Context Engineering: The Skill That Replaced Prompt Engineering

Sketch of layered context architecture feeding into an agent core

Context engineering: designing the information space, not just the prompt

About six months into building MetaCaddie — our golf AI that processes thousands of course records, shot patterns, and caddie recommendations — I realized every conversation with our agent was starting with the same problem. The model was smart enough. The tools were wired up. But the agent kept making bad routing decisions and surfacing irrelevant context.

I kept assuming the fix was a better prompt. I rewrote the system prompt four or five times. Added examples. Tightened the tone. Tweaked the temperature.

None of it moved the needle.

What moved the needle was redesigning what information the agent could reach, and how that information was structured when it arrived.

That was the moment I understood: prompting and context engineering are not the same discipline.

The Distinction Nobody Makes Clearly

Prompting is about instructions. What to do, how to act, what format to use. Prompting says "you are a helpful assistant that..."

Context engineering is about the information space the agent operates inside. Every tool response. Every document structure. Every piece of metadata attached to a result. The shape of what the agent can see determines the quality of what it can reason about — far more than any instruction you write.

Here is the analogy that made it click for me: imagine asking a brilliant analyst to do research, but the only room you give them access to has filing cabinets stuffed with unlabeled papers in random order. Now imagine giving that same analyst a well-organized library — labeled shelves, indexed cards, a floor map, and a librarian who can tell them "those stacks cover 1990-2010, and there are about 400 relevant documents on your topic, mostly in the economics section." Same analyst. Completely different outcome.

Most engineers are building filing cabinets and wondering why their analysts are struggling.

Four Levels of Tool Response Quality

This is the framework I now use when I am building or reviewing any AI system. Every tool response your agent uses falls somewhere on this spectrum.

Level 1: Raw text, no structure.

The agent gets whatever the tool returns. Logs, API responses, database dumps — pipe it in and let the model sort it out. This is where most people start. It works well enough on toy problems and fails at production scale.

The problem is not the volume. It is the cognition cost. A hundred-thousand-line log file is computationally cheap to include and cognitively toxic — it drowns out the reasoning capacity you need.

Level 2: Structured responses with source metadata.

You add shape and provenance to every tool response. Instead of returning a blob of text, you return an object: here is the data, here is where it came from, here is when it was retrieved. The agent now knows not just what the data says but where it came from and whether it should be trusted.

This is the first meaningful win. It is a small change in how you write your tool wrappers and a large change in how the agent reasons about credibility and conflict resolution.

Level 3: Faceted responses with aggregate signals.

Along with the results, you include counts, category breakdowns, and distribution signals. If a search returns ten items, you also tell the agent: there were 847 total matches, 600 were from the last 30 days, and here is how they break down by category.

Now the agent can make intelligent decisions about whether it got what it needed. It can recognize when a result set is suspiciously narrow. It can decide to refine the query rather than blindly using the first page of results.

At MetaCaddie, this meant our agent could tell the difference between "I found three relevant course records" and "I found three records but there are 40 more I did not surface — let me try a different query."

Level 4: Peripheral vision.

This is the hardest level to build and the most valuable in production systems. Your tool responses include structured hints about what exists beyond the results that were returned. Not just what you found — but a map of the terrain you did not fully explore.

Think of it as the difference between a search that returns ten blue links and a search that also tells you: there are results in three adjacent categories you did not search, there is a cluster of highly-cited documents from 2019 that match your query but were filtered by recency, and there is a common pattern among the top results that might indicate you are searching for the wrong thing.

The agent can now make strategic decisions about where to explore next. It is navigating, not just consuming.

Why This Changes How I Design Systems

The mental model shift that I keep coming back to: agents explore information landscapes, they do not consume data chunks.

When I thought of RAG and retrieval as "find the relevant chunk and inject it," I was designing vending machines. The agent puts in a query, a chunk comes out, the agent uses it.

When I started thinking of agents as explorers of an information space, everything changed. The explorer needs a map. The explorer needs signals about what is nearby, what is worth investigating, and what can be safely skipped. The explorer needs to know when they have seen enough to make a decision and when they are missing something critical.

The success metric shifts accordingly. The question is no longer "did we find the right chunk?" It becomes "did we design an information space that the agent can navigate competently?"

These are very different problems. The first one is a retrieval problem. The second one is an architecture problem.

What This Looks Like in Practice

When I review an AI system now — mine or anyone else's — I look at tool responses first.

If every tool is returning raw strings, that is Level 1. The agent is probably making bad decisions not because the model is weak but because the information space is opaque.

If tool responses have metadata but no aggregate signals, you are at Level 2. You have the low-hanging fruit but you are leaving a lot on the table.

Concrete things I changed in our golf AI system:

Subagent summaries before main context. When our agent needs to process shot history or course conditions, a small subagent runs first. It summarizes the raw data and passes structured output upstream. The main agent never sees the noise — it sees a clean, opinionated summary with confidence signals attached.

Explicit boundary signals in search results. Every search response now includes a result_coverage field that tells the agent what percentage of the total matching records were returned, and what filters were applied. The agent uses this to decide whether to refine the query or proceed.

Category metadata on every document. Course records, tournament data, caddie notes — each one carries metadata tags that the agent can filter on in follow-up queries. Instead of re-searching from scratch, the agent can narrow within a known result space.

Failure mode signaling. When a tool call returns empty or low-confidence results, the response includes a search_alternatives field: here are adjacent queries that might help. The agent treats this as a navigation hint, not a dead end.

None of these changes required a better model. None required prompt rewriting. They required thinking clearly about the information space the agent operates inside.

The Senior-Junior Divide

I have worked with a lot of engineers on AI systems over the last two years. The pattern I keep seeing: junior engineers spend their cycles on prompt iteration. They rewrite instructions, add few-shot examples, adjust formatting constraints. They treat the model as the variable to optimize.

Senior engineers — the ones building things that actually work in production — spend their cycles on information architecture. They treat the model as roughly fixed and ask: given this model's capabilities, what information design will let it perform at its ceiling?

Prompting is important. You cannot ignore it. But it is a local optimization. Context engineering is the global one.

The structural question — how do I design the information space my agent navigates? — is the one that separates systems that work at demo scale from systems that work in production.

The Metadata Is the Prompt

The framing I keep returning to: metadata is prompt engineering.

Every field you attach to a tool response is an instruction. Every aggregate signal you include is teaching the agent how to think about the data. Every boundary signal is teaching the agent how to navigate when results are incomplete.

You are not just returning data. You are teaching the agent to have a mental model of the information landscape — what kinds of things exist, how they are organized, where the edges are, and what a good navigation decision looks like.

The engineers who internalize this build agents that are qualitatively different from the ones who do not. Not because they wrote better instructions in a system prompt. Because they built a better world for the agent to think inside.

That is the meta-skill. Everything else is downstream from it.