Somya Anand

Co-evolving AI systems

Opening

In some of my recent reading, I found myself returning to an idea from Thinking, Fast and Slow: that intelligence isn’t just about reasoning well, but about bringing the right memory or context to mind at the right time. Much of what we call judgment comes down to what gets surfaced — and what stays buried — in a given moment.

That idea stayed with me as I began working on a small passion project over the past few months. I started wondering what it would mean to build an AI system that felt like an extension of how I think — not a stateless assistant, but something that could maintain context over time, build memory deliberately, and adapt as my work, goals, and constraints evolved. Not a better model, necessarily, but a system that grows alongside its user.

As I went deeper, this stopped feeling like a purely personal question. I began recognizing the same patterns from my earlier work building AI-powered systems in legal technology at Relativity. In legal workflows, the model is rarely the limiting factor. The harder problems live elsewhere: handling nuance, tracking evolving case context, managing exceptions, and knowing when a piece of information matters now versus when it can safely be ignored. An answer can be fluent and still wrong if the system fails to surface the right context at the right time.

Around the same time, reading more work on context management and adaptive AI systems — including research and writing from groups like Anthropic — gave language to something I had been circling intuitively. What I was trying to build wasn’t just an application layered on top of a large model. It was a different kind of system altogether: a co-evolving AI system — one whose intelligence emerges over time, shaped by how it manages context, memory, evaluation, and adaptation at inference time as it interacts with real data and real workflows.

What I mean by a co-evolving AI system

When I talk about a co-evolving AI system, I’m not referring to a model that is continuously retrained. I’m referring to a system whose behavior changes over time without changing its weights — through how it manages context, memory, feedback, and evaluation at inference time.

Most AI systems today are effectively static at deployment. They may call powerful models, retrieve relevant documents, or apply clever prompting strategies, but the system itself does not meaningfully change as it is used. Each interaction is treated as largely independent, bounded by a context window and optimized for a single response. Any adaptation, if it happens at all, is deferred to future training runs.

A co-evolving system behaves differently. It treats usage as a signal. Over time, it learns what information tends to matter, which details are foundational versus transient, where nuance is critical, and where simplification is acceptable. This learning doesn’t happen through gradients, but through system-level decisions: what gets written to memory, what decays, how context is constructed for the next interaction, and how feedback — explicit or implicit — shapes future behavior.

Crucially, the system evolves with its environment. As data distributions shift, workflows change, or user goals evolve, the system adapts at inference time rather than waiting for retraining cycles. Intelligence, in this framing, emerges not just from the underlying model, but from the ongoing interaction between the model, the memory it is given, the context it is placed in, and the signals it receives from the world.

In this sense, co-evolving systems are less like deployed artifacts and more like long-running processes. They succeed or fail not on individual responses, but on how well they maintain alignment over time.

Context, memory, and evaluation as first-class design problems

Once you start thinking in terms of co-evolving systems, a pattern becomes hard to ignore. The hardest problems are no longer inside the model. They sit around it — in how context is constructed, how memory is managed, and how evaluation happens over time.

Context is the most visible constraint. Language models operate within finite context windows, forcing systems to decide what information is included and what is left out on every interaction. These choices are rarely neutral. Over-weighting recency can erase foundational details. Over-aggressive summarization can remove the nuance that made information important in the first place. In long-running systems, context selection becomes a continuous optimization problem, not a preprocessing step.

Memory introduces a different set of trade-offs. What should persist across interactions? What should decay? What happens when new information conflicts with older assumptions? Most systems either store too little — treating each interaction as disposable — or store too much, accumulating noise without a clear notion of salience. In practice, memory policies matter as much as retrieval quality. They determine whether a system feels consistent over time or slowly drifts away from what once made it useful.

Evaluation is the quietest, and often the weakest, part of the loop. Many systems rely on offline benchmarks or single-turn metrics that say little about long-horizon behavior. A response can score well while pushing the system further off course. Without feedback loops that account for time, repetition, correction, and reversal, systems optimize locally and drift globally.

What connects these three is that they all operate at inference time. They shape system behavior without touching model weights — yet they are the primary levers through which long-running systems adapt to real environments. Treating context, memory, and evaluation as first-class design problems is what separates systems that merely respond from systems that evolve.

Inference-time adaptation: gradient-free levers that matter

Once context, memory, and evaluation are treated as first-class concerns, another shift becomes clear. Many of the most effective ways AI systems adapt do not involve updating model weights at all. They happen at inference time, through mechanisms that shape how the model is used rather than how it is trained.

These gradient-free adaptation mechanisms sit directly in the loop where the system encounters real data. Real users. Real edge cases. Retrieval and reranking determine what the model sees. Verification and critique loops decide which outputs are trusted. Inference-time search or best-of-N sampling trades additional computation for robustness. None of these change the underlying model — but all of them change system behavior in meaningful ways.

What makes these mechanisms powerful is not novelty, but placement. When workflows shift or new constraints emerge, inference-time adaptations can respond immediately. They don’t require waiting for retraining cycles or assuming the future will resemble the past.

In long-running systems, these mechanisms compound. Small improvements in retrieval reduce the burden on summarization. Better verification limits noisy memory. Thoughtful memory policies reduce the need for aggressive pruning. Over time, the system doesn’t just produce better answers — it behaves more reliably under change.

This is why systems built on the same models often diverge in practice. The difference is rarely the model itself. It lies in how seriously inference-time adaptation is treated as a driver of intelligence at the system level.

Where co-evolving systems fail

The need for co-evolving systems becomes most visible when systems don’t evolve — when they are deployed into dynamic environments but designed as if the world were static. These failures rarely appear as dramatic breakdowns. They show up slowly, as drift.

One common failure mode is recency bias. Systems that privilege the most recent interactions often feel responsive at first, but gradually lose coherence. Foundational context is pushed out. Long-term goals are overshadowed by short-term signals. The system becomes reactive rather than reflective — correct in the moment, but misaligned over time.

Another failure mode comes from over-compression. Summarization is often treated as a solution to context limits, but aggressive compression removes the very nuance that made information important. Qualifications disappear. Edge cases flatten into averages. The system hasn’t forgotten — it has misremembered.

Memory corruption is subtler, and more dangerous. When new information conflicts with old assumptions, many systems lack a principled way to resolve the tension. Instead, they overwrite. Over time, memory becomes internally inconsistent, and behavior starts to feel unpredictable. Users experience this not as a single failure, but as a gradual erosion of trust.

Evaluation failures compound all of this. When success is measured turn by turn, systems are blind to long-horizon degradation. A response can look good while pushing the system further off course.

In practice, most failures in long-running AI systems are not model failures. They are system design failures.

A broader convergence

What’s striking is that these failure modes are no longer isolated. They’re increasingly visible across industry and research, even in systems built on the most capable models available. As models improve, the limitations of static system design become harder to ignore.

This has led to a quiet shift in focus. Work on context management, inference-time compute, and adaptive system design — including writing by Sara Hooker and organizations like Anthropic — points to the same conclusion: progress is no longer driven by scaling alone. It’s driven by how systems behave after deployment.

Different groups use different language for this shift, but the underlying idea is consistent. Long-running AI systems must be able to adapt continuously, without relying on constant retraining. Co-evolving systems are not a speculative future direction. They are a practical response to a world that doesn’t stand still.

Closing

Thinking in terms of co-evolving AI systems changes what we optimize for. The goal is no longer the best possible response in isolation, but systems that behave well over time — systems that remain aligned as context accumulates, goals shift, and environments change.

This reframing also shifts responsibility. If intelligence emerges from context selection, memory policies, and evaluation loops, then these are not peripheral engineering details. They are the core design decisions that determine whether a system compounds understanding or quietly drifts away from what matters.

Co-evolving systems demand a different kind of rigor. They force us to reason about forgetting as carefully as remembering, about adaptation without overfitting, and about evaluation that spans weeks or months rather than single interactions.

Humane intelligence, in this sense, is not about making systems sound more human. It’s about building systems that respect continuity — systems that surface the right context at the right time, preserve what matters, and adapt without losing their way.