Skip to content

The World is the Test

Bruegel's Netherlandish Proverbs — a world of agents, each acting on their own logic

There's a classic simulation in disaster science. A theater fire. Scientists modeled individual agents, gave them simple behaviors, and placed them in a room with limited exits. The finding that reshaped the field: most casualties didn't come from the fire. They came from the crowd. The crush at the exits. Panic turning coordinated individuals into a dangerous collective.

You could study each agent in isolation and never predict what happened when you put them in a world together. I first encountered this during my undergrad research on agent-based modeling in online communities, studying how individual behavior transforms the moment it's embedded in a social environment. The principle was clear, but the tools were limited. Classical ABM agents were rule-based: if-then logic, state machines, probability distributions. They could model crowd dynamics. They couldn't simulate a hesitant buyer weighing a purchase, or a job candidate who grows quiet under pressure. The agents were brittle, and the worlds they inhabited were brittle with them.

Large language models changed that. An LLM-powered agent reasons, adapts, and responds to context in ways that feel believably human. The agents don't need to be brittle anymore. Which means we can finally build the kinds of uncertain, dynamic worlds that evaluation has always needed.


Most LLM evaluation today still works like studying people in isolation. Even the sophisticated versions (LLM-as-judge with carefully designed rubrics, calibrated scoring, multi-dimensional assessment) still operate on a single frozen interaction. Give the model an input. Score the output. Move on. This is genuinely good work for measuring whether a system can summarize a document or answer a factual question. It breaks completely when the system you're evaluating exists in relationship with an unpredictable user.

The systems we're actually deploying (sales agents, coaching tools, customer support bots) don't operate in a vacuum. They encounter a user who goes quiet after the third exchange. A user who deflects every substantive question. A user who knows exactly what they need and just wants to be left alone. The agent's quality isn't an intrinsic property. It's a response to the world it's placed in.

You can't evaluate that with flat input-output pairs. You have to build the world.

The Evaluation Shift: flat evaluation vs. world-based evaluation


This insight is converging from several directions at once.

Simile AI, founded by the team behind the Generative Agents research at Stanford, is building simulation engines for human behavior at population scale. Their foundational work placed twenty-five LLM-powered agents in a simulated town where they autonomously formed relationships, coordinated meetups, and produced emergent social behavior, none of it scripted. Their latest research simulates over a thousand real individuals with striking fidelity. The thesis isn't prediction. It's causal reasoning: if we made this decision, what would happen? If we had made a different one, what would have happened instead?

The evaluation research community arrived at a parallel conclusion. The Agent-as-a-Judge framework, presented at ICML 2025, showed that agentic systems need agentic evaluators. Not just scoring final outputs, but examining the entire process of how an agent navigated a task. In code generation experiments, it matched human expert agreement at roughly 90%, compared to 70% for standard LLM-as-judge methods. The cost dropped by 97%.

The self-driving industry offered a blueprint years earlier. Waymo logged twenty million real-world miles but ran over twenty billion simulated ones. Edge cases that matter most are too rare to encounter naturally. You have to construct them. This principle is now migrating to AI evaluation: persona-driven simulation at massive scale, stress-testing systems against behaviors that would take months to surface in production.

Different origins, same destination: to evaluate an agent, you have to build the world it will inhabit.


I've tested this in two domains.

The first was an AI interviewer. Static evaluation (prepared questions, expected answers) would have missed the point entirely. The interviewer's job was to adapt: notice when a candidate was struggling, redirect when someone was rambling, step back when someone was doing well. So I built the world instead. Synthetic personas designed to push the system into different adaptation patterns. The evaluation became: did the system respond appropriately to this particular user behavior in this particular moment?

The second was a customer intelligence system, sales enablement designed to predict how different buyer personas respond to different approaches. Again, the system's value wasn't in a single correct recommendation. It was in navigating uncertainty across diverse behaviors: the skeptic who needs data, the champion who needs confidence, the executive who needs brevity.

Same pattern both times. The design choices in building that world (which personas, which pressures, which failure modes) weren't preparation for the evaluation. They were the evaluation itself.


Other fields have known this for decades. Disaster modelers simulate the fire and watch what emerges. Epidemiologists build population models where second-order effects reveal themselves. Urban planners simulate migration flows.

Agent-based modeling taught me this principle years ago. AI evaluation is arriving at it now. Not because the idea is new, but because the tools to build believable simulated worlds finally exist. LLMs turned out to be what Andrej Karpathy described: simulation engines trained on a diverse population. The question was never whether to simulate. It was whether our agents were capable enough to make simulation meaningful.

They are now. The interesting work ahead isn't building better judges. It's building better worlds.


Further Reading

Foundational:

Evaluation:

Simulation & Testing:

My work:


Resources

Articles and references that informed this piece:

Comments