The Two Roles of Simulators in RL
In Reinforcement Learning (RL) applications intended for deployment in a real-world system—be it a physical robot, a power grid, or a live web service—simulators (of physics or human biology/psychology) are a central tool to help us operate in reality and play fundamental roles in RL research. That said, I’ve recently come to realize that simulators can play two distinct roles. The choice between them has subtle yet important implications, and the tendency to conflate them is a primary source of muddled discussions in the literature and can be counterproductive for practical projects.
The goal of this post is to elaborate on the two roles—which I’d call “digital twin” vs. “vibe simulator”, for the lack of better terms—and their implications, hoping to bring some clarity to the subject. The tl;dr version is the following table, which I will explain in the sequel.
| “Digital twin” | “Vibe simulator” | |
|---|---|---|
| Fidelity | High (near 1-to-1 replica). | Reasonable (captures essential dynamics). |
| What’s learned from the simulator | Final policy deployed in the real system. | Algorithm and pipeline design. |
| Data used to train the final policy | Simulation data. | Real-system data. |
| Real-system constraints (e.g., offline data collection) | Real-system constraints are circumvented. | Real-system constraints must be replicated in simulation. |
| What’s evaluated in research | Training algorithm. | Entire pipeline, such as data collection design, training algorithm, and OPE methods. |
| Example | Sim2real robotics. | Validating an offline RL pipeline before applying it to a production system. |
I was recently discussing an RL project for optimizing a wireless network. My suggestion was to first build a simulator to test out the RL pipeline. My collaborator immediately pointed out the immense difficulty: modeling the physics of beam wave reflections would be a massive, perhaps insurmountable, research project in itself. After some back-and-forth we realized that we were talking past each other: he was envisioning a “digital twin” of the real system, where the simulator must have high fidelity and enable sim2real transfer of the policy learned in simulation. I, on the other hand, was thinking of a much more crude “vibe” simulator. Its purpose is not to produce a final policy, but to act as a cheap, fast sandbox to verify that the entire algorithmic pipeline is sound. This includes everything from the data collection protocol to the learning algorithm to the final evaluation method. The policy trained in this toy simulator is disposable; the validated pipeline is the valuable asset that we then deploy on the real system.
Having had this conversation, and realizing that there are many subtle points I wanted to convey to the students who plan to build the simulator and design the application pipeline, I figure it’s best to write them down in a somewhat organized manner. To start, the different roles of simulators are intimately related to the kind of application paradigm we adopt:
Paradigm 1: Direct Learning in the Real System
This is the baseline: the agent takes actions, alters states, and learns from the feedback in the real system itself. Now, these actions have real consequences, so safety and cost considerations often impose constraints on how the agent is allowed to behave. This could vary from limited exploration in online RL to passively observing how the system currently operates without action intervention as in offline RL, or a more complex procedure of designing a data collection policy and then performing offline RL.
There are two typical challenges with this approach: First, constraints on data collection can make it difficult to learn a good policy. In an offline setting, the fixed dataset may lack the exploratory coverage needed to learn a strong policy. Second, when data is not given and needs to be collected proactively, the iteration cycle can be painfully slow and tedious. In domains like clinical trials, active data collection easily takes months. Even in faster-moving areas like online recommendation, deploying new policies often involves approvals and procedures that defy the rapid iteration common in ML research. This already significant challenge is further amplified by the fragility of RL algorithms: outside well-studied benchmarks, successfully training an agent in a novel problem, even in a fast simulator, can already take numerous tries. Transplanting this same trial-and-error process into a tedious, real-world cycle makes the problem infinitely harder.
Paradigm 2: Circumventing Reality with a Digital Twin
A standard approach to overcome both hurdles of direct interaction is sim2real: If we can build a high-fidelity simulator, we can circumvent real-world constraints (e.g., run millions of online interactions in simulation even if the real system only allows for offline data) and iterate rapidly. The final policy is trained on simulation data and then deployed. Of course, the approach is limited to problems where it is feasible to build accurate simulators, which is not always the case especially when the problem involves complex physical processes or involves the psychological and biological aspects of human behavior. (*1)
As a remark, the above is the the most naive and straightforward paradigm for sim2real, and there are variations such as domain randomization, partial simulation combined with exogeneous processes represented by data (the balloon), and online finetuning. In all such cases, the simulator’s role is similar.
Paradigm 3: Accelerating Paradigm 1 with a Vibe Simulator
This brings us to the third paradigm, which uses a “vibe simulator” as a precursor to direct real-system interaction in Paradigm 1 when a “digital twin” is impossible or too costly to build. The goal here is to build a cheap simulator that’s just good enough to debug the entire end-to-end pipeline. That is, if we plan to collect X data from the real system and run algorithm Y on the real data, let’s try collecting X data from the simulator (that is, under whatever constraints we would face in the real system), run algorithm Y on the simulation data, and test the policy in simulation.
If everything works well, we at least have some confidence that it is worth trying the pipeline on the real system. There is no guarantee that this will succeed, but at least it can rule out options that never have a chance to work. When I teach RL, students regularly apply standard algorithms to problems to their interest, and it is very common that the first version of working code only comes after weeks (if not months!) of trial-and-error. I think it’s fair to say that you really don’t want to learn such lessons in the slow, tedious real-system cycles, and want to do as much preparation as possible in simulation even if it means the overhead of building the simulator.
Task-specific Simulators vs. Research Benchmarks
Careful readers may have already noticed that this “vibe simulator” approach is nothing new; in fact, it’s what RL research has been doing for years with benchmarks like Atari and MuJoCo. (After all, no one expects a policy trained on Pendulum-v1 to directly control a real, physical pendulum.) This naturally raises the question: if we already have these standard testbeds, why bother building a new, task-specific one? While these benchmarks are invaluable for developing general-purpose algorithms, there is often a large gap between them and a specific, novel application. The reason is that getting an off-the-shelf algorithm to work on a new problem domain is a notorious challenge. A custom, problem-specific “vibe simulator” acts as a crucial intermediate step. It allows for rapid iteration on these details for a problem that is much closer to the target system than a generic benchmark is. Furthermore, there is a growing concern that the RL research community may have inadvertently overfit algorithms and their standard hyperparameters to the specific dynamics of popular benchmarks, making them less robust when applied to new domains. A custom simulator helps mitigate this risk.
Clarifying Confusion in the Literature
Failing to distinguish between these roles of simulator is, in my opinion, a primary cause of confusion and debates in the research community where participants seem to talk past one another. In the controversy around Go-Explore, the algorithm solved the difficult Montezuma’s Revenge by leveraging the emulator’s ability to reset to previously visited states. This sparked a debate about whether using resets was “cheating.” I also remember similar debates when sample efficiency was brought up as a performance metric independent of computational cost. To me, the answer depends entirely on what kind of application paradigm the research aims to inform and, consequently, the role of simulator in the paradigm. If we aim to inform applications where simulators are accurate digital twins, then state resets can be a valid tool. (*2) If the goal is to develop an algorithm for a real-world system, then it’s an invalid technique for that context. (*3)
At the risk of sounding overly cynical, it’s interesting to observe people’s mindsets about this issue and the inconsistencies therein. Many primarily care about simulations alone, mostly because they only care about applications that are simulations themselves (e.g., developing strong game-playing agents) or applications where high-fidelity simulators are (or will be) available. There is also a significant portion of the community that views RL as the path towards “general human intelligence” and reject perception inputs that are not human comprehensible (e.g., RAM state). Then there are a relatively small community (including myself) who are interested in problems where digital twins do not seem feasible in the foreseeable future (success in such problems is also the scarcest). (*4) My complaint is that participants in these debates rarely make their implicit assumptions clear. Perhaps worse, many seem not to have considered the question at all, leading to contradictory behaviors in different contexts: Some work exclusively on simulation-only problems yet firmly reject resets just because it “does not feel right”. Some publish extensively on “RL on real-world X” yet never study any component in the pipeline other than the training algorithm. As a community, I hope we can all take a moment when writing our papers to ask: “Do I think of the simulator in my paper as a digital twin or as a vibe simulator, for the ultimate application I aim to inform?” Being more explicit about the answer will perhaps bring us closer to deploying RL in a wider range of problems.
Footnote Remarks
(*1) One interesting direction is to leverage recent advances in generative models to create good simulators for more and more problems, especially those where accurate simulation was thought to be impossible. Recent examples include leveraging diffusion models to build physics simulator of complex environments, and using LLMs to model human user interactions. In this recent paper we make the point that building such simulators can also be viewed as a part of the RL problem (instead of thinking of RL as something we do after the simulator is given), which leads to interesting statistical problems on model selection.
(*2) State resetting can be tricky in complex simulators, especially those with latent variables. I touched on some related issues in this old paper, and am cooking something related; stay tuned!
(*3) The Go-Explore paper did have a similar discussion after the debate sparked by their initial draft and suggested workarounds for applying their method in non-resettable environments.
(*4) … which means this blogpost may involve some armchair theorizing :). If you have practical experience and disagree with my points, I absolutely want to hear (see below)!
Feedback and Comments
This is my first technical blogpost and my website is not set up to host comments, but you are welcome to leave comments below this tweet.
Update: giscus-based comments are now enabled!
This post was last edited on July 18, 2025.