Table Stakes for Pragmatic Development Using LLMs

I've been using Claude Code since early access — first writing, then published this article in September 2025. The ecosystem has moved enough since then that the original needed a real update. Here's what's changed and what I got wrong the first time.

The short version: prompting is the least interesting part of this. Context engineering, eval discipline, and model economics are where the real gains hide.

1. Context Engineering ≠ Prompting

I spent months thinking about how to write better prompts. That was the wrong question.

The better question is: what information space are you giving your agent to explore, and how is that space structured?

Here's the distinction that finally clicked for me. Prompting is about the instruction. Context engineering is about designing everything the agent can reach — tool responses, file structures, metadata, subagent outputs — so that the agent can navigate toward the right answer rather than having to be told exactly what it is.

A concrete example: I used to dump raw error logs into my agent context. The agent would try to parse them. It worked poorly. The better approach is to run a subagent specifically to summarize those logs, then pass the summary upstream. Bad context is computationally cheap but cognitively toxic — 100k lines of logs costs nothing to include but destroys the quality of reasoning that follows.

The framework I now use has four levels of tool response quality:

Level 1 (where most people start): Raw text, no structure. The agent gets what the tool returns.

Level 2 (immediate win): Add source metadata and structure to every tool response. The agent now knows where information came from, not just what it says.

Level 3 (meaningful improvement): Faceted responses — include counts, categories, and aggregations alongside results. The agent can refine its next query based on the shape of the answer space.

Level 4 (worth investing in for production systems): Peripheral vision. Structured hints about what exists beyond the top-k results. The agent can make strategic choices about what to explore next rather than just consuming what was surfaced.

The mental model shift: agents explore information spaces, they don't consume data chunks. Your job is to make that space navigable, not just populated.

Practical upshot: Review every tool response your agent uses. If it's returning raw text, add metadata. If it's returning a list, add counts and categories. If you're piping noisy data (logs, diffs, large files) directly into the main context, route it through a subagent summary first.

Your PRD Foundation Still Matters

None of this changes the value of good context documents. CLAUDE.md is still your project's constitution — coding rules, architecture decisions, tech stack rationale, style guidelines. Supporting markdown files in a notes/ directory give agents everything they need to understand business context without burning cycles asking.

What's changed is I'm more intentional about what goes in these files. Dense, comprehensive context files are good. Dumping everything in is not. Every line in your context costs reasoning capacity. Be selective.

Time-lapse visualization of pragmatic LLM development workflow in action

The evolution of development speed with LLM tools - from hours to minutes

2. Master Your Workflow Strategy

Plan Mode vs Accept Edits

Plan mode for architecture. Accept edits for implementation. I toggle between these constantly and the pattern still holds. What's changed is I use plan mode more aggressively — not just for complex features but for any task where I don't have high confidence in the approach. The cost of planning is low; the cost of a wrong direction is a full context burn.

Have an actual dialogue in plan mode. Tell Claude to ask you questions. The best implementations emerge from back-and-forth, not one-shot prompts. If your agent isn't pushing back on your requirements, something's wrong.

Pick Your Tech Stack Carefully

LLMs work best with popular frameworks and well-documented libraries. This is even more true in 2026 — the models have absorbed far more context about mainstream choices than obscure ones. If you're reaching for an unusual package, ask yourself whether the productivity gains justify the agent tax.

Human in the Middle

Feature branches for each session. Review everything before merging. Ask Claude to build debugging tools for complex logic. None of this changed.

What I've added: checkpoints. For any session longer than about 30 minutes of agent work, I pause and do a manual review before continuing. It's faster than debugging a wrong direction that compounded for an hour.

The evolution of a developer using AI tools - from typing code to orchestrating intelligent agents

When you master LLM development workflows, you become the orchestrator, not just the coder

3. Model Economics: Pick the Right Tool

This is something I got completely wrong in the original article. I was using the same model for everything. That's expensive and slow.

The decisions worth making:

Task typeModel tierWhy
Architecture design, complex reasoningOpus / most capableThis is where inference quality matters most
Code generation, standard featuresSonnetBest cost-quality tradeoff for the majority of work
Quick lookups, formatting, simple transformsHaikuYou don't need a reasoning model to rename a variable
Parallel agents doing independent tasksSonnet or HaikuMultiply throughput, not quality

A related concept worth understanding: inference-time compute. Newer reasoning models (o-series, DeepSeek R1, and their successors) spend more compute at inference time — generating chains of reasoning before the final answer. For hard problems, this works dramatically better than just throwing a bigger model at it with a one-shot prompt.

The practical implication: for genuinely hard reasoning tasks (complex architecture decisions, debugging obscure behaviors, designing evaluation rubrics), enable extended thinking or use a model with built-in chain-of-thought. For most coding tasks, it's overkill. Matching the inference strategy to the task difficulty is one of the highest-ROI optimizations available right now.

YOLO mode - suppressing confirmations for faster development workflow

YOLO mode in action - when you trust your workflow, suppress the confirmations

4. Leverage Terminal Support Tools

The tooling ecosystem has consolidated since 2025. A few things that are still worth your attention:

tmux + git worktrees is the right pattern for parallel agents on the same codebase. Give each agent its own worktree on a separate branch. No collisions, easy review, clean merges.

YOLO mode (suppressing confirmations) works well when you keep agent tasks small and testable. The moment you start doing big, hard-to-reverse operations, turn it off.

awesome-claude-code on GitHub tracks the ecosystem. Worth checking quarterly — tools that were experimental a year ago are now stable, and the ones that sounded promising have mostly been abandoned.

The original article mentioned Claude Squad and Claude Composer. Both are less relevant now. The native Claude Code multi-agent capabilities have absorbed most of what made those useful.

5. Graduate from Prompts to Agents with Tools

The framing shift that matters: stop thinking about "prompts" and start thinking about agents with specific capabilities, failure modes, and feedback loops.

An agent has:

  • Tools — what it can do
  • Context — what it knows
  • Evaluation — how you know if it's working

Most people invest heavily in the first two and skip the third. The third is where the real quality comes from.

Advanced hacking workflow with AI agents coordinating complex development tasks

When your agents start coordinating complex workflows autonomously

6. The Eval Framework You Actually Need

This section was the biggest gap in the original article. I had a few bullets about "build automated testing." That's not enough.

Here's what I actually do now:

Step 1: Manual review after every significant change

Before anything else — 30 minutes, 20 to 50 outputs, just you and the results. Not dashboards. Not aggregate metrics. Read the actual outputs your agent is producing and write down what's wrong.

This sounds tedious. It's the most valuable 30 minutes in your development cycle. Every AI product I've seen that shipped with quality problems skipped this step.

Step 2: One decision-maker

Appoint one domain expert as the "benevolent dictator" for quality decisions. This person's judgment is ground truth. Not a committee. Not a rubric. One expert who knows what good looks like and whose call ends debates.

For your own projects, this is you. For team projects, pick one person and protect their time to do this role seriously.

Step 3: Error analysis, not just pass/fail counts

When outputs are bad, understand why before building evaluators. The process:

  1. Gather 100+ interaction traces
  2. Read them and write open-ended notes on issues you see (open coding)
  3. Group notes into failure categories, count frequency (axial coding)
  4. Keep going until 20 consecutive traces reveal no new failure patterns

Only then do you build evaluators for the failure modes you actually found, ranked by frequency.

On binary vs. scored evaluations: Binary pass/fail beats Likert scales. Forced binary judgment is clearer, faster to label, and easier to reason about. 1-5 ratings introduce threshold ambiguity and require larger samples to detect meaningful differences.

What a healthy eval suite looks like: If you're passing 100% of your evals, you're not testing hard enough. 70% pass rate is a signal that your evaluators are catching real problems — which is the point.

Entity resolution is still hard

LLMs struggle tracking entities across large codebases. Be explicit: full paths, consistent naming, clear references. This hasn't changed.

7. Feedback Tools

Browser testing always. If your project has a UI, test in the browser constantly. Agents can't see what you see.

Figma MCP for design-to-code workflows. Copy designs you're working with to your personal workspace to ensure MCP access works without permission issues.

Docusaurus for PRDs. Both humans and agents can read and contribute to the same structured documents. Still the best option I've found for keeping context documents alive.


The key insight that's still true: pragmatic LLM development isn't about replacing human judgment; it's about amplifying it. You're the architect, product owner, and final reviewer. What's changed is the scope of what needs amplifying — it's not just the coding anymore, it's context design, model selection, and evaluation discipline.

Get those right and the coding part is almost easy.