Prompt Engineering Didn't Die. It Got Unrolled.

Sketch of a single prompt unwinding into four concentric feedback loops — The prompt was always the visible step of an invisible loop

A few weeks ago I caught myself running the same prompt four times. I'd tightened the wording each time, fed in a different piece of context each time, watched the answer get incrementally better each time. Then I rewrote the prompt one more time, started a fresh thread, and ran it again. I'd been doing this exact dance for three years.

That was the loop. I was the loop. The whole time.

The current consensus is that prompt engineering is dead, replaced by context engineering, on its way to being replaced by something called harness engineering. That story isn't wrong, but it misses what actually happened. Prompts didn't die. They got unrolled into the loops that were always running underneath them. Four of them, more or less simultaneously, in roughly the last twelve months.

Once you see them, you can't unsee them. And once you can see them, you start designing for them.

What This Looked Like For Me in 2024

I want to make this concrete before I name the loops, because the loops were not theoretical for me. They were a daily tax I paid without knowing I was paying it.

In 2024 I was building a clinical documentation pipeline. The model would take a recorded clinician-patient conversation and produce a structured note. I spent most of my time iterating prompts. I had a doc with maybe forty versions of the same instruction, each one tagged with what I'd noticed wrong with the previous version. When a new failure showed up in production, I'd open the doc, find a similar past failure, copy the fix forward, and try again. I called this "prompt engineering" and I told other people I was good at it.

What I was actually doing was running four separate loops in my head and pretending it was one skill.

I was running a refinement loop every time I tweaked the wording. I was running a workflow loop every time I remembered which version of the prompt went with which kind of recording. I was running a retry loop every time I asked the model to redo a section because the first pass was wrong. And I was running a learning loop every time I read an output and decided whether it had earned a place in my prompt history.

None of those loops were written down. All of them lived in my head, in a fragmented Notion doc, and in the heads of two other engineers who had developed slightly different versions of the same intuitions. When one of us went on vacation the system got worse. When all three of us went on vacation simultaneously the system got dangerous.

The thing that changed in 2026 is not that those four loops became necessary. They were already necessary. The change is that all four of them, for the first time, can be externalized into infrastructure that other people and other models can read.

The Move Most People Make

When someone tells me prompt engineering is dead, what they usually mean is that they bought a vector database, set up retrieval, and called it context engineering. Or they added a self-critique step and called the system agentic. Or they wired up a small eval harness and called it production-ready.

Each of those is a real upgrade. Each one is also one loop out of four. Teams that stop after one move keep wondering why their output plateaus six weeks later. They've moved one piece of work from their head into infrastructure. The other three pieces are still happening in their head, unobserved, unmeasured, and not improving.

The shift is not "add a loop." The shift is that the system that was always running in your head was always made of four loops, and 2026 is the first year you can plausibly externalize all of them. Most of the leverage comes from doing that together, not separately.

Loop 1: The Refinement Loop

This is the loop you ran the most and noticed the least. You wrote a prompt. You read the output. You decided it was almost right. You added a sentence. You swapped a word. You ran it again. You did this eight times, then declared victory. It felt like writing. It was actually a loop.

What changed in the last year is that the loop became durable. Threads now last weeks instead of hours. They compact themselves when they fill up rather than forcing you to start over. You can interrupt one mid-execution to add direction without losing the work that's already underway. You can drop in unedited thoughts — the messy version of what you were going to say — because the messy version turns out to be richer context than the polished one.

Here's the number that should bother you: a model that scores 98 on a clean single-prompt evaluation drops to 64 across a multi-turn run with even modest accumulation. That gap is not a model problem. It is the cost of running the refinement loop without infrastructure. The loop was always there. The cost was always there. You just didn't have a way to see it.

The other number that should bother you: roughly two-thirds of enterprise AI failures in 2025 were attributed to context drift, not context exhaustion. The model didn't run out of room. The room got polluted, the model got confused, and by the time you noticed, the thread was already poisoned. Context drift kills agents long before context limits do, and the only way to see it happening is to instrument the loop and watch.

The fix is not to truncate aggressively when you hit the wall. The fix is to compact proactively at sixty or seventy percent of the window, before drift sets in. The thing that gets compacted is not the conversation, it is a living summary of intent, decisions, and pending work that gets rewritten incrementally as the thread evolves. The full transcript can be filed away as a virtual document the model can retrieve from if a detail turns out to matter later. The active context stays small. The historical context stays accessible. The loop keeps running.

The other underappreciated lever is tool output verbosity. A single tool call that dumps three thousand tokens of JSON the agent didn't ask for will degrade reasoning for the rest of the thread. Filter and truncate tool responses at ingestion, not at compression. The agent only needs the fields it's going to reason about next. Anything else is noise that compounds.

This used to be code you wrote. As of early 2026 it is increasingly a flag you turn on — the major providers ship native compaction as a server-side primitive now. The thing that was your job last year is becoming part of the runtime. The loop is still there. The work of running it moved.

When I sit down to do real work with a model now, the first decision is not what to prompt. It is where the work is going to live. If the answer is "this thread, then I'll lose it," I'm running the loop in my head again. If the answer is "this durable thread, which already knows what I tried last week, and will compact itself when it gets too long," I'm not.

Loop 2: The Workflow Loop

The second loop is the one you ran with sticky notes and tab groups. You took the output of one model, pasted it into another. You knew which model was good for which job. You knew which file held the doc the model needed for context. You knew the right co-worker to ask when the model got stuck. None of that lived anywhere except your head.

What's emerging in its place is a per-project layout where the model can navigate the same way you do. A flat behavioral contract that tells the model how you want to work. An annotated index of the docs and channels that matter, with a paragraph each explaining what they're for. A directory tree the model can grep and glob through without having to ask which folder things live in.

The first time you do this it feels like overhead. The third time it feels like the cheapest leverage you've ever bought. You've stopped re-explaining your environment every session. The model arrives oriented. Every finished artifact becomes context for the next session, which means each iteration starts from a higher floor than the last one. That is what people mean when they say context compounds. It does not compound automatically. It compounds because you wrote down the things that used to live in your head.

Two patterns have changed how I write these contracts.

The first is the rule that you only commit a rule on the second occurrence of a problem, not the first. The first time something goes wrong it might be a fluke. The second time it is a pattern. If you write a rule for every error you ever notice, the contract drowns in one-off lessons within a month, signal-to-noise collapses, and the model stops being able to find the rules that actually matter. Twice equals pattern. Once equals noise. This sounds like a small heuristic. It is actually the difference between a behavioral contract you can maintain and a graveyard of past panics.

The second is the boundary between the contract and the pipeline. If a violation would block a merge in CI, the rule does not belong in the contract. It belongs in CI. Prose rules shape behavior. They do not guarantee it. The model will read your contract and apply it most of the time, which is exactly what you want for things like "prefer this naming pattern" or "explain your changes before making them." It is the wrong tool for "never run a destructive command without confirmation." Anything that has to be enforced gets enforced in code. Anything that just has to be habitual gets enforced in prose.

There is also a third layer most teams miss. Behavioral context wants to live at three different levels — org-wide rules everyone shares, service-level conventions for one codebase, and personal quirks that nobody else needs. Dumping all three into one file creates a hundred-line document where the load-bearing five lines are invisible. Splitting them apart lets each layer stay short and focused. The model walks the directory tree and assembles the right contract for the directory it's working in. This is the same mental model engineers already have for config — you don't put your shell aliases in the company-wide .bashrc. The contract is just a config file the model can read.

I have started judging a team's AI maturity by whether their workflow loop has a name and an owner, or whether it's still distributed across the brains of three senior engineers.

Loop 3: The Retry Loop

The third loop is the simplest. The model gave you an answer. You looked at it. You said "no, do it again." That was the retry loop. You ran it by hand with the regenerate button, or by tweaking the temperature, or by sampling three times and picking the one you liked.

The retry is moving inside the model itself now. Instead of one chain of thought stretched longer and longer until it loses its way, the model spawns multiple shorter chains in parallel and synthesizes them at the end. Single-thread reasoning has a well-known failure mode where an imperfect first step locks the model into a suboptimal path it cannot recover from — the further it walks down the wrong road, the more confidently wrong it gets. Forking into parallel threads costs almost nothing on modern hardware because decode is bandwidth-bound, not compute-bound. Eight parallel reasoning paths take roughly the same wall-clock time as one.

That last sentence has a billing footnote attached to it that most teams miss. Reasoning models charge you for the tokens you see and the tokens you don't — the internal chain-of-thought tokens are billed at the same rate as output tokens, and depending on the task you can end up paying for five to forty-five times more tokens than you actually receive. Multiply that by N in a best-of-N strategy and the cost math gets serious fast. Parallel sampling is cheap in wall-clock time and expensive in dollars, and the trap is treating those two as if they were the same constraint.

The other thing that took me too long to learn is that best-of-N has a sweet spot, not a slope. Best-of-three to best-of-eight is where most of the quality improvement lives. Going from eight to sixty-four buys you very little for many times the cost, and on some tasks the noise of selecting from a larger pool actually makes the verifier worse. The verifier becomes the bottleneck before the model does. If you don't have a calibrated way to score candidates, generating more of them just gives you more chances to pick the wrong one with confidence.

This is the loop you used to run with a regenerate button. The model now runs it natively. The practical implication is that your retry policy is no longer a UI thing you do. It is an architectural choice — best-of-N versus a single deep chain, what the verifier is, when to fan out versus when to commit. The retry got promoted from an interaction pattern to a system design decision.

The teams I see do this badly treat parallel sampling as a cost line item rather than a quality lever. The teams I see do this well treat it as the first thing they reach for when a task has a verifiable answer and a non-trivial failure cost. The teams I see do this brilliantly know which tasks deserve N=3, which deserve N=8, and which deserve N=1 with a very precise stopping condition. That is taste, and it is one of the few places where prompt-era intuition transfers cleanly into the new world.

Loop 4: The Learning Loop

The fourth loop is the one most teams still have not externalized, and it is the one that determines whether the other three compound or just spin.

For years this loop was eyeballing. You read outputs. You shipped if they looked fine. You complained on Slack if they didn't. The "did this work? would I trust it next time?" judgment lived in your head, was applied inconsistently, was never written down, and never improved.

The replacement is an eval harness, and the thing most teams misunderstand is that the eval harness is not a quality gate. It is the learning loop. Without it, your system does not compound. Every session starts at zero. You can have brilliant context infrastructure and a beautiful retry policy and durable threads stretching back six months, and if you do not have a way to score outputs against a rubric you can audit and update, the system has no way to get better. It will get exactly as good as the prompts you write today and stop there.

There is a shorter version of this argument. Without an eval harness, every model swap is a debate. With one, every swap is a number. If your team is still arguing about whether the new model is better, you have just told me you do not have evals. You may have something called evals. You do not have the loop.

I treat the eval harness as the first thing I build now, not the last. Even a primitive version closes the loop. The shape I keep coming back to is roughly a hundred cases split across four scenario buckets: happy paths the model should handle effortlessly, recoverable failures where it should notice it's wrong and self-correct, unrecoverable failures where it should refuse or escalate, and adversarial inputs where it should not be fooled. A hundred cases is enough to detect meaningful regressions. It is not enough to ship a model safely on its own — for that you need real production traffic monitoring — but it is the smallest investment that turns the loop from a hand-wave into a system.

The rubric is the part that gets smarter. The cases are the part that gets bigger. Both get versioned. When the system is wrong, I do not edit the prompt first. I edit the rubric, run the harness, see what else moves. Sometimes the right answer is that my taste was wrong, not the model's behavior. The rubric is where that conversation lives.

One more thing about scoring that took me a year to internalize: a single judge model drifts. The same rubric run by the same judge will produce subtly different scores six months apart, because the judge gets updated, the calibration shifts, and the absolute numbers stop being comparable across time. The fix at small scale is to recalibrate against a small human-labeled set every quarter. The fix at large scale is to use a panel of judges instead of one, average their scores, and accept that you are buying statistical reliability at the cost of inference compute. Both approaches are right at different points in a system's life.

The compounding kicks in here. Loop 4 is what tells Loops 1, 2, and 3 whether they are working. Without it, every iteration is a guess.

How I Evaluate Teams Now

This framing is not just for self-assessment. In the last six months I have started using the four loops as a diagnostic when I look at how a team is using AI.

Three categories show up consistently.

Teams that have externalized all four loops. These are rare. Their refinement loop has compaction strategy and steering behavior they can describe in a sentence. Their workflow loop has a written contract, a routing manifest, and an owner. Their retry loop has a defended position on N, a verifier they trust, and a budget. Their eval loop produces a number that everyone agrees represents quality. The output of these teams is two to four times what their headcount predicts, and the gap is widening. They compound. They are dangerous.

Teams that have externalized one or two loops well. This is most of the field. They usually have a strong workflow loop and a weak eval loop, or strong evals and no real refinement infrastructure. They are doing real work and shipping real things, but they keep relearning lessons their tools should be remembering for them, or they keep shipping changes whose effect they cannot measure. They are not bad teams. They are running an incomplete system that nobody has shown them how to complete.

Teams that have externalized zero loops but talk a lot about agents. These teams have adopted the vocabulary and not the architecture. Their "agents" are clever prompts wrapped in for-loops. Their "context engineering" is a longer system prompt. Their "evals" are a Slack channel where people complain about bad outputs. They are doing 2024 work with 2026 names on it, and the longer they spend in that state the more painful the transition out becomes, because they have made organizational commitments to language that hasn't yet been backed by infrastructure.

The diagnostic move is straightforward. Ask which of the four loops is externalized, who owns it, and where it lives. The answers tell you almost everything you need to know about whether the team is going to compound this year or run in place.

What This Actually Changes

I do not write prompts the way I used to. I do not even start there.

When I sit down to build something now, I run through four questions before I touch a keyboard:

Where does this work live? Fresh thread or durable one. If durable, what's the compaction strategy. Who can steer it mid-flight.

What does the next session need that this one is producing? Where does that context get written down. Who else needs to read it. How will the model find it without being told.

Does this task get one shot or many? If many, what's the verifier. If one, why am I confident enough for that.

How will I know it worked? Precisely enough that a different person or a different model could check it. Where is the rubric. When the rubric is wrong, how does it get updated.

The questions are not new. The honest answer is that I always had to answer them, I just answered them in my head, which meant I answered them inconsistently and could not pass the answers to anyone else. Putting them into infrastructure has made my output more legible to my team and, more uncomfortably, more legible to me.

The Mistake That Keeps Repeating

The most common version of this mistake is treating the four loops as if they were a menu and you have to pick one. Teams stand up retrieval and call it done. Teams build an eval suite and call it done. Teams sign up for a long-running thread tool and call it done.

The compounding does not start until all four are running and visible. A team with great retrieval and no eval harness is optimizing a workflow they cannot measure. A team with brilliant evals and no durable threads relearns the same lessons every session. A team running parallel sampling at inference time with no learning loop is just spending more money to make the same mistakes faster.

There is also a more subtle version of this mistake. Teams externalize all four loops but do not wire them to each other. The eval harness produces scores, but the scores never make their way back into the workflow contract or the retry policy. The durable threads accumulate context, but the context never gets distilled into a rubric. The pieces are all present and nothing compounds, because the loops are running in parallel rather than feeding each other.

The shift is not "add a loop." It is "make every loop legible, and then wire them together." That is a different kind of work, and it is the work that distinguishes teams who get genuine leverage from teams who keep declaring the latest move the answer.

The Fifth Loop, Coming Soon

There is one more loop on the horizon that I do not yet trust myself to externalize, but I want to name it because I think the next year is about closing it.

The four loops I have described — refinement, workflow, retry, learning — make a system that improves how you use the model. The fifth loop improves the model itself, continuously, from the signal the learning loop produces. Today most teams treat the model as a fixed input. They tune everything around it. The fifth loop says: the eval signal you are already generating is exactly the training signal a preference-optimization pass needs. The score the rubric gave you yesterday becomes the gradient the model fine-tunes on tomorrow.

I have seen pieces of this in production. I have not yet seen it done well enough that I would recommend it to a team that hasn't already mastered the other four. The risk is that you start optimizing for the rubric instead of the underlying quality, and because the rubric is also your only measurement instrument, you never notice. But this is where the loops are heading. The thing that used to be "we fine-tune every six months on a curated dataset" is going to become "the eval harness feeds a continuous improvement loop on production traffic."

When that loop closes, the four become five, and the unit of work shifts again. I am not there yet. Most teams are not there yet. The teams that get the first four right this year are the ones who will be there next year.

Prompts didn't die. They got unrolled into the four loops that were always running underneath them. The skill that's emerging is not prompt craft. It is loop design — choosing which loops to externalize, what their stopping conditions are, what their feedback signals look like, and how they wire into each other.

The teams that win this year are not the ones with better prompts. They are the ones whose loops you can read.