The LLM Year in Review: What Actually Mattered in 2025 (And What Was Noise)

Sketch of LLM landscape as topographic map with reasoning breakthrough radiating out

What actually mattered — not the benchmarks, the underlying shift

The most surprising thing about 2025 was not that a Chinese lab beat the frontier. It was that they did it by training a model to reason — and that reasoning, generated through reinforcement learning on its own outputs, turned out to be worth more than adding a few hundred billion parameters.

That reframe matters. The dominant narrative going into 2025 was that scale was the story: whoever could build and train the biggest model on the most data with the most chips would win. That story was not wrong, but it was incomplete in a way that cost a lot of people time and money and credibility. DeepSeek R1 dropped in January and made the incompleteness impossible to ignore.

I have been building with these models for long enough that I have some calluses. I do not startle easily. R1 startled me.

What We Predicted

The going-in assumptions at the start of 2025 were reasonable given what the evidence supported. More parameters meant better models — consistently, across domains, across tasks. The scaling laws from OpenAI's early work held up well enough that compute and data volume were the obvious levers to pull. The frontier was defined by the labs that could afford the largest runs.

Post-training mattered — RLHF and RLAIF had clearly improved model behavior past what raw pretraining produced — but the working assumption was that it was table stakes, not differentiator. You did the post-training work to make the model usable. The capability ceiling was still set by the pretraining run.

RAG was widely considered the answer to long-context and knowledge freshness problems. The playbook was: train a large base model, RLHF it into alignment, build retrieval infrastructure around it, and ship. That was the architecture you defended to stakeholders in early 2025.

The corollary: if your model was not performing well enough, the answer was a bigger model. That was the instinct embedded in most product teams I talked to. I had it too.

What Actually Happened

DeepSeek R1 changed the question from "how do we get more compute into training" to "how do we get more compute into inference."

The technique at the core of R1 was not novel in isolation — reinforcement learning for reasoning had been explored before. But R1 applied it at scale and in a way that produced something practically different: a model that, when asked a hard question, would generate an extended chain of reasoning, correct itself mid-thought, backtrack, and arrive at answers that beat models with significantly more parameters. It did this because it had been trained, through RL, to reward itself for producing correct final answers — and the reasoning tokens were the mechanism it learned to use to get there.

The implication that landed hardest for me: the reasoning was not scaffolded in from outside. The model learned to reason by being incentivized to produce correct outputs. RL was generating the intelligence, not just aligning it.

Long-context ate more of the RAG use case than anyone expected. Context windows expanded past 128K tokens and then past 1M, and it turned out that for a meaningful class of retrieval problems, simply shoving the relevant documents into the context window worked as well or better than a tuned retrieval pipeline. The RAG playbook did not die — it is still alive and appropriate for many situations — but the boundary shifted substantially, and a lot of expensive retrieval infrastructure turned out to be compensating for context limitations that no longer existed.

The other thing that happened was a genuine commoditization pressure on base models. When R1 demonstrated that post-training could produce reasoning behavior that offset raw parameter count, the competitive advantage of having the largest pretraining run narrowed. Post-training pipelines — the recipes for RLHF, DPO, and RL-based reasoning generation — became the actual differentiator.

The Three Things That Genuinely Mattered

Not all of 2025 was signal. There was a lot of noise — a lot of model launches that did not change what I actually built, a lot of papers that did not hold up, a lot of benchmark wins that meant nothing in production. Here is what actually mattered:

1. Reinforcement learning for reasoning generation is a genuine capability unlock.

This is the DeepSeek R1 lesson, and it is bigger than any single model release. The insight is that RL, applied to the model's own reasoning traces and rewarded for correct final answers, can produce reasoning behavior that scales with inference-time compute rather than training-time compute. You give the model more tokens to think, it produces better answers. The reasoning is not a wrapper — it is the mechanism.

What this means in practice: the reasoning-capable models that emerged in 2025 responded differently to prompting, performed better on multi-step problems, and degraded less gracefully under distribution shift than their non-reasoning counterparts. When I switched clinical documentation tasks from a standard generation model to a reasoning model and gave it space to think through ambiguous cases, accuracy on edge cases went up meaningfully without any prompt engineering changes.

The failure mode: reasoning models are slower and more expensive per call when you engage the thinking mechanism. Teams that applied them uniformly — enabling extended thinking on straightforward lookups and simple extractions — burned budget without getting benefit. The model is not always smarter when it thinks longer. It is smarter on the problems where more thinking actually changes the answer.

2. Inference-time compute is a real optimization variable, not a niche research concern.

This is connected to reasoning, but broader. The underlying insight from 2025 is that compute spent at inference time and compute spent at training time are partially substitutable. Not fully — you cannot inference your way to capabilities the model genuinely does not have — but for reasoning-intensive tasks, a smaller model with more inference-time compute consistently hit Pareto-optimal territory against larger models with no inference budget.

I built a table for myself early in 2025 that I still use: if the task is reasoning-heavy and verifiable, reach for inference-time compute first. If it is knowledge-heavy or pattern-matching at scale, reach for a stronger base model. Getting this wrong in either direction is expensive.

The business implication is the one I keep coming back to: if intelligence is partly a function of inference-time compute rather than purely training-time compute, then the cost of intelligence becomes tunable in ways it was not before. You can pay for more intelligence on the calls where you need it and less on the calls where you do not. That is a meaningfully different cost model than "which model tier should we default to for everything."

3. The post-training pipeline is now the competitive differentiator.

Labs with smaller pretraining budgets shipped models that competed credibly with frontier models because they invested heavily in the post-training stack: RL-based reasoning, preference optimization, alignment techniques that shaped model behavior in ways raw pretraining cannot. The ceiling set by pretraining is still real, but the distance between the pretraining ceiling and what ships has become the competitive arena.

For teams building products — not training frontier models — this has a direct implication. Fine-tuning and post-training techniques became meaningfully more accessible in 2025. DPO is stable to run on modest infrastructure. RL-based reasoning generation, while still expensive to do from scratch, produced open weights that can be fine-tuned. The recipe is more legible than it was in 2024, and the tools are better.

What This Means for 2026

Three things I am carrying forward:

Inference-time budget is a first-class design decision. Before choosing a model, I decide whether the task warrants extended reasoning. That question shapes everything downstream — model selection, latency budget, cost modeling, evaluation setup. It is not an afterthought.

Post-training investment is worth it for specialized domains. If you have domain-specific preference data — clinical corrections, legal review feedback, code review patterns — the techniques to use it are mature and the ROI is high. The argument that fine-tuning is too hard or too expensive has been declining for two years. In 2025 it mostly stopped being true for teams with any meaningful scale.

The RAG-versus-long-context question needs to be answered per use case, not globally. Long context is genuinely better for some workloads. RAG is still better for others — high-cardinality retrieval over large corpora, freshness-critical applications, situations where you need to cite specific sources. The right answer depends on the distribution of your queries, your latency requirements, and your update frequency. Teams that made this call globally in 2025 made it wrong.


The headline of 2025 was not a model name or a benchmark. The headline was a shift in where the intelligence was coming from. It turns out the question was never just "how big is the model" — it was always "how much of what this model knows can we surface at the moment it needs to reason." Inference-time compute and RL-generated reasoning were the year's answer to that question.

We are building in 2026 with a meaningfully different set of assumptions than we had at the start of 2025. That is a good sign. The field moved. The question is whether the instincts that got updated in 2025 actually stick — or whether the old habits come back the next time a large model launches with a big parameter count and a nice press release.

I am betting on the update holding. The evidence is too clear to ignore a second time.