Why Your AI Gets Smarter When You Let It Think Longer

Why making the model think longer actually works
Last year I wasted about three months trying to solve a clinical documentation problem by chasing better models. Every time outputs were bad, my instinct was: bigger model, more parameters, smarter weights. I burned through a meaningful chunk of API budget swapping between models, convinced the answer was somewhere in a higher benchmark number.
It was not.
The problem was that I was treating inference like a light switch — on or off, answer or no answer. What I eventually figured out is that inference is more like a dial. And most practitioners, including me for most of 2025, had it turned to the minimum.
That dial is test-time compute. And understanding it changed how I build AI products.
The Idea, Stripped Down
Model training is fixed cost. You do it once and ship. What happens at inference — how much computation the model gets to do before returning an answer — has historically been an afterthought. You send a prompt, the model generates tokens left to right, done.
The research Weng synthesizes shows this is leaving enormous capability on the table. The core insight: giving a model more computation at inference time, through chain-of-thought reasoning, multiple attempts, or iterative self-correction, can produce results that beat much larger models that are not given that space.
In practical terms: a 70B model allowed to reason explicitly across 2,000 tokens of scratchpad can outperform a 200B model answering in 150 tokens. The compute spent is different. The cost profile is different. But the outcome quality can be dramatically higher — and for many tasks, cheaper per correct answer.
This is not an academic footnote. It is the central bet behind o1, o3, DeepSeek R1, and every reasoning model that has shipped in the last 18 months. When you enable extended thinking in Claude or query an o-series model, you are turning that dial up.
The question for practitioners is: when should you, and when should you not?
The Four Techniques Worth Knowing
Not all inference-time compute is the same. Here are the four mechanisms that matter in production:
1. Chain-of-Thought
The simplest and most widely deployed. Instead of asking the model to jump directly to an answer, you prompt it to reason through intermediate steps first. "Think step by step" is the casual version. More structured CoT involves explicit scaffolding — decompose the problem, list assumptions, reason toward a conclusion, then answer.
✅ Works for: logic-heavy tasks, multi-step calculations, structured analysis, code review where the model needs to trace execution
❌ Does not help for: simple lookups, factual recall, classification with obvious signal, anything where the answer is fast and unambiguous
The failure mode is cargo-culting CoT onto tasks that do not need it. A model asked to think step by step about which of two numbers is larger is wasting tokens. Worse, verbose reasoning on simple tasks can introduce errors the model would not have made in a direct response.
2. Best-of-N Sampling
Generate N candidate answers, then pick the best one. At the simplest level, "best" means highest average token probability — the model's own confidence signal. At a more sophisticated level, you use a separate reward model to score candidates.
This technique is wildly underused. A model that gets 40% of hard math problems right on a single attempt gets over 85% right when you sample 64 times and select the best scoring answer. The underlying capability was there. You just needed to ask more than once.
The cost is linear with N — you pay for 64 completions to get 1 answer. So the question is whether the quality gain justifies the cost for your use case. For healthcare documentation at OpenLoop, the answer was often yes. A wrong medical summary is far more expensive than the API cost of generating eight candidates and picking the one a verifier scores highest.
✅ Works for: high-stakes decisions where errors are costly, tasks with a clear verifiable correct answer, batch jobs where latency is not a constraint
❌ Avoid when: real-time UX requires sub-second latency, token cost is the binding constraint, the task has no good verification signal
3. Process Reward Models
Standard reward models evaluate the final answer. Process Reward Models (PRMs) evaluate intermediate reasoning steps.
This is the step-level version of best-of-N. Instead of generating N complete solutions and picking the best at the end, you use a trained verifier to score each step of reasoning and prune paths that are going wrong before they compound. The practical effect: you get more correct final answers without having to run the full reasoning chain to completion on every path.
The hard part: PRMs require training data that labels intermediate steps as correct or incorrect. That is expensive to produce. For most product teams, PRMs are not something you build — they are something you use through models that have already been trained with this feedback (like the o-series).
But understanding that this is what is happening under the hood shapes how you prompt. Reasoning models respond differently to instruction style than standard generation models. They do better with explicit problem decomposition and worse with overly rigid output formatting requirements that fight against their native reasoning flow.
4. Sequential Self-Revision
The model generates an answer, receives feedback, then revises. The feedback can come from an external tool (a code interpreter that ran the code and returned an error), another model acting as a critic, or a structured prompt that asks the model to review its own work.
This is the mechanism behind agentic loops — the pattern where Claude Code runs your tests, reads the failure, and tries again. It is also the hardest to get right in practice because revision loops can cycle indefinitely if the feedback signal is wrong, noisy, or pointing at the wrong thing.
My rule: revision loops need a hard exit condition and a verifiable feedback signal. "The model said it looks good" is not a verifiable feedback signal. "The test suite passed" is. "The SQL query returned a valid result set with the expected schema" is. "A secondary model scored the output above 0.85 on a calibrated rubric" is, if you have done the work to calibrate it.
When Test-Time Compute Beats a Bigger Model
Here is the frame I actually use when making this decision:
| Situation | Reach for test-time compute | Reach for a bigger model |
|---|---|---|
| Task is hard and has a verifiable answer | ✅ | |
| Latency budget is tight (<500ms) | ✅ | |
| Task is reasoning-heavy (math, code, logic) | ✅ | |
| Task is knowledge-heavy (recall, retrieval) | ✅ | |
| You can afford linear cost per correct answer | ✅ | |
| You are doing real-time conversational UX | ✅ | |
| Errors are expensive (medical, legal, financial) | ✅ | |
| Task is pattern-matching at scale | ✅ |
The key insight Weng's analysis confirms empirically: for reasoning tasks specifically, small-model-plus-compute consistently hits Pareto-optimal territory versus large-model-no-compute. You can get equivalent or better results at lower cost per correct answer if you match the inference strategy to the problem type.
This only holds for the class of problems where more thinking actually helps. For problems where the model either knows the answer or does not — factual recall, language translation, straightforward summarization — extended compute buys you very little.
What I Actually Do Differently Now
Three concrete changes in how I build:
1. I decide inference strategy before I decide model tier.
The old question was: "Which model should I use?" Now the first question is: "Does this task benefit from extended reasoning?" If yes, I use a reasoning model or enable extended thinking, and I often end up on a smaller base model than I would have chosen before. If no, I use the fastest model that clears the quality bar.
2. I use Best-of-N for high-stakes outputs, always.
For anything in the clinical documentation stack — summaries that go into a patient record, structured data extractions that feed downstream decisions — I generate three to five candidates and run a lightweight verifier. The verifier is usually another model call with a tight scoring rubric, not a full PRM. This caught a category of errors that prompt engineering alone never fully solved.
3. I treat agentic revision loops with more respect.
The instinct when a loop fails is to make the loop smarter. The right instinct is usually to make the feedback signal sharper. A revision loop with a precise, tool-grounded feedback signal (code ran, test failed, here is the traceback) will outperform a clever multi-step critic with a fuzzy quality signal every time. Get the signal right before you get the loop fancy.
The Mistake I Still See
Teams try to solve reasoning problems with retrieval. They add more context, better chunking, more sophisticated RAG pipelines — and they are genuinely surprised when the model still makes the wrong call on hard multi-step problems.
Retrieval solves the "model doesn't know this" problem. Test-time compute solves the "model doesn't think hard enough about this" problem. These are different problems with different solutions. Conflating them burned me, and I see it burning teams constantly.
If your model has access to all the relevant information and is still getting the answer wrong, adding more retrieval is probably not the fix. Giving it more space to reason probably is.
The practical summary: inference-time compute is the most accessible high-leverage lever most AI product teams are not using. The technology is deployed and available today — reasoning models, extended thinking toggles, the ability to generate multiple candidates and verify them. You do not need to train anything new.
What you need is a clearer mental model of when thinking longer actually helps versus when it just costs more. Build that intuition and you will make better architecture decisions, spend less on API calls, and ship products that are meaningfully more reliable on the tasks that matter.
The dial exists. Learn when to turn it up.
