Three Hard Truths About LLMs in Production Nobody Warned Me About

Sketch of production LLM system with stochastic failures and deprecation cliff

Three things nobody warns you about before you ship

Eighteen months into running LLM infrastructure in a clinical setting, I got a bug report that stopped me cold. A physician assistant was using our AI-assisted note generation tool and noticed that the same patient note, run twice in a row, produced structurally different output. Not different words — different clinical emphasis. One run foregrounded the chief complaint. The other buried it and led with medication history. Both were defensible summaries. Neither was reliably the same as the other.

I had been building ML systems for twelve years at that point. I knew models were probabilistic. I thought I had accounted for that.

I had not. Not really. Not in the way that mattered for production.

The thing is, nothing about this should have surprised me. The underlying math was not new. But there is a gap between understanding something in the abstract and reckoning with what it means when it is your system, your users, and your domain. That gap is where most LLM production failures live.

Here are the three truths I had to learn the hard way.

Truth 1: Stochasticity Is a Systems Design Problem, Not a Prompting Problem

The first reflex when you see inconsistent outputs is to fix the prompt. Make it more specific. Add a format constraint. Include a few-shot example that anchors the structure. This helps. It is not enough.

Stochasticity in LLMs is not a defect in the prompt — it is a property of the system. Temperature, sampling, top-p, top-k, the underlying model weights, even the serving infrastructure can introduce variance. You can reduce it with careful prompting and temperature tuning. You cannot eliminate it by prompting alone. And in any domain where consistency matters — clinical documentation, legal summarization, financial reporting — "reduced but not eliminated" is not an acceptable risk posture.

The realization that actually changed how I built things: stochasticity belongs in the same category as latency and availability. It is an infrastructure characteristic you design around, not a prompt bug you fix.

What that looks like in practice:

Constrain the output space aggressively. Free-form prose has enormous variance. JSON with a defined schema has much less. Enum fields have almost none. Anywhere you can replace an open-ended generation with a structured output, you reduce the variance surface. In our clinical system, the shift from narrative summaries to structured SOAP-format outputs with defined field types cut output variance by more than half — measurably, because we were logging and diffing outputs.

Build determinism checkpoints. Not everything in an LLM pipeline needs to be generated. Extracted entities — medication names, dosages, dates, ICD codes — can be verified against structured sources after generation. If the extracted value does not match the source record, you flag it or re-run, rather than trusting the model got it right on the first pass.

Treat output consistency as a metric. Run the same prompt against the same input a defined number of times during evaluation and measure agreement. If your eval suite only checks correctness on a single pass, you are missing the variance dimension entirely. We added a consistency score to every model evaluation we ran. It surfaced problems that single-pass evals never would have caught.

The hardest part of this truth is that it requires you to stop expecting the model to behave like a function. Functions are deterministic: same input, same output. LLMs are not. The sooner your architecture stops assuming they are, the more resilient your system becomes.

Truth 2: The Model You Built For Will Be Deprecated

Sometime in Q3 of last year, our primary model provider quietly updated the weights behind an endpoint we had been calling for five months. No breaking change in the API. No version flag in the response. The outputs just started being subtly different. Slightly more verbose. Different default behavior on ambiguous instructions. Our evals — which had been green for months — started showing drift.

We caught it within a few days because we had continuous eval infrastructure running. Teams without that infrastructure caught it when users complained, or did not catch it at all.

Model deprecation and silent model updates are not edge cases. They are the normal operating condition of building on externally hosted LLMs. OpenAI, Anthropic, Google — they all update models, deprecate endpoints, and make changes that affect your outputs on timelines you do not control. This is not a criticism. It is the nature of a rapidly evolving field. But it means your system has a dependency that can change underneath you without warning.

The thing nobody tells you upfront: your prompts are not portable. The prompt you carefully engineered for GPT-4-turbo in January will not perform identically on GPT-4o in May. Different training data, different RLHF, different behavioral defaults. The instruction following behavior that your prompt relied on may have changed. The verbosity dial is calibrated differently. The way it handles ambiguous instructions may have shifted.

I learned this by having to re-tune a substantial portion of our prompt library every time we migrated model versions. That work is unavoidable — but you can make it systematic rather than panicked.

Version everything. Your prompts should be versioned and pinned to the model version they were tested against. When you update the model, you explicitly re-evaluate the prompt at the new version. Do not assume forward compatibility.

Run shadow evals against candidate models continuously. Before a model update forces your hand, run your existing eval suite against the candidate replacement on a sampling of real production inputs. Quantify the delta. Know what changed before it changes in production.

Build your system to be model-agnostic where possible. The parts of your architecture that are model-specific — temperature settings, output parsing logic, system prompt structure — should be isolated and swappable. The parts that are model-agnostic — your business logic, your data pipelines, your evaluation criteria — should not be tangled with them. This is not theoretical clean architecture. This is the thing that determines whether a model migration takes three days or three weeks.

The clinician-facing applications we build have users who trust the system enough to integrate it into their workflows. That trust is calibrated on a specific quality of output. When a model update silently changes that quality — in either direction — we owe it to those users to catch it before they do.

Truth 3: Silent Failures Are Worse Than Loud Failures

When an LLM raises an exception, throws an error, or returns something that clearly cannot be parsed, you know immediately. Your monitoring lights up, someone gets paged, the failure is visible and addressable. Loud failures are painful, but they are honest.

Silent failures are different. A silent failure is an output that looks correct, passes format validation, does not trigger any alert — and is wrong. In healthcare, wrong can mean medication information that is technically present in the output but in a context that inverts its clinical significance. Wrong can mean a risk flag that was in the source document not appearing in the generated summary because the model de-emphasized it under specific phrasing conditions. Wrong can mean a normal-looking note that an experienced clinician would recognize as clinically incoherent, but that no automated validator flags.

The danger of LLMs in high-stakes domains is not that they obviously fail. It is that they confidently produce plausible-sounding outputs that are subtly wrong, and those outputs travel downstream unquestioned.

I thought about this a lot after we found an issue where our summarization system was occasionally dropping relevant history under specific input conditions. The outputs still looked like summaries. The structure was intact. The language was clinical and appropriate. The missing information was not hallucinated — it was simply absent. Our automated evals, which scored on structure and presence of key terms, did not catch it. A nurse caught it.

Designing for silent failure detection is one of the hardest parts of LLM engineering because the failures are domain-specific by nature. There is no generic detector for "clinically significant omission." You have to know your domain well enough to know what should always be present, what should never appear, and what patterns indicate that something important was silently dropped.

Assert on business invariants, not just output format. Format validation tells you the output is shaped correctly. Business invariants tell you it contains what it must. For clinical notes, that means checking: if the source document mentions a known allergy, does the output? If the source includes an active medication, is it represented? These checks are domain-specific and must be written by people who understand the domain — but they are the layer of validation that actually catches silent failures.

Build explicit absence detection. The hardest failure to catch is something that should be there but is not. Most validation logic checks what is present. You need validation logic that checks for expected presence — fields that should be non-null given the input, categories that should be represented given the task, risk flags that should surface given the clinical content. This requires knowing your domain well enough to define those expectations programmatically.

Create feedback loops from domain experts. Automated detection will always lag behind what an experienced domain expert can spot in seconds. The clinical AI systems that maintain quality over time are the ones with structured channels for users to flag suspicious outputs — not just thumbs up/down, but categorized feedback that goes into your error analysis pipeline and eventually surfaces as new eval cases. The nurse who flagged our issue was not a bug reporter. She was a domain expert who knew what a correct output should contain. That knowledge has to feed back into your system.


None of this is what the papers and blog posts about LLMs in production emphasize. The conversation is usually about prompting, RAG architectures, fine-tuning, benchmark performance. Those things matter. But they are not what will determine whether your system holds up at scale, survives model deprecations, and catches the failures that matter most.

What determines that is whether you treat stochasticity as an infrastructure problem, build systems that are resilient to the model changing underneath you, and invest in the kind of domain-specific validation that catches the failures that look like successes.

The model is not the system. The system is everything you build around the model to make it safe to run.

Build that part seriously.