The Honest Guide to LLM Evals: What Actually Works

30 minutes. 20 to 50 outputs. You, reading. That is the eval.
I shipped a clinical summarization feature into a pilot last year. The outputs looked fine in our internal testing. Three days in, a nurse flagged that the model was occasionally dropping allergy context from the structured note — not hallucinating it, just not carrying it through. The kind of thing that could hurt someone.
We had automated evaluations running. Green across the board.
That was the moment I got serious about evals. Not as a QA checkbox, not as a metric to show stakeholders — as an actual engineering discipline that determines whether your product is safe to run.
Here is everything I have learned since then about what actually works.
Why Most Teams Get Evals Wrong
The failure mode is consistent: teams build evaluators before they understand their failure modes. They run a few outputs through GPT-4 asking it to score quality 1 to 5, collect an aggregate number, and call it done. Then they ship.
The aggregate number tells you almost nothing. A 4.2 average on a 5-point scale hides the one output in twelve that is going to get someone in trouble. In healthcare, that one output matters more than the other eleven.
The second failure mode is delegation. Teams hand evaluation to the model — having an LLM grade its own outputs, or a sibling model doing the grading without any human calibration on what good actually looks like. I have done this. The numbers look plausible. They do not catch what matters.
The discipline that actually works has three parts: read your outputs yourself, own the quality definition in one person, and understand your failure modes before you write a single evaluator.
Step One: The 30-Minute Review You Keep Skipping
After any significant change to your system — new model, prompt update, new data source, major feature — set a 30-minute block and manually read 20 to 50 outputs.
Not summaries. Not dashboards. The actual outputs.
I know this sounds obvious. Almost no one does it consistently. There is always a reason to skip it: the aggregate metrics look fine, the timeline is tight, you ran it last sprint. Skip it enough times and you will ship the thing that hurts someone.
What you are looking for is not pass or fail. You are writing down anything that feels wrong, unclear, incomplete, or surprising. Notes like "this output omits the medication dosage when the input had it" or "this one is technically accurate but a clinician would find it confusing." You are not grading. You are building a catalog of what failure looks like in your specific domain.
In healthcare this step is non-negotiable. The stakes are not symmetrical — a confusing output from a consumer chatbot is annoying; a confusing output in a clinical workflow is a patient safety issue. You need to see your outputs through that lens personally before you trust any automated system to catch problems for you.
Thirty minutes. After every significant change. Put it in your definition of done.
Step Two: One Benevolent Dictator for Quality
Here is a dynamic that kills eval discipline in teams: quality-by-committee. Six people with different intuitions about what good looks like, endless debates about edge cases, and an evaluation rubric that satisfies no one because it was designed to accommodate everyone.
The framework that works is a single domain expert who makes final quality calls. I call this the benevolent dictator model and the name is accurate — one person, whose judgment is ground truth, whose call ends the conversation.
For solo projects, that is you. For a team, you pick the person with the deepest domain knowledge and you protect their time to do this role seriously. In clinical AI, this should be a clinician, not a product manager or an engineer. Engineers can write the evaluators. They should not be the ones defining what a good clinical summary looks like.
This matters because evaluation is fundamentally a judgment call dressed up as a measurement problem. You can automate the execution — running outputs through a classifier, computing pass rates — but the standard being applied has to come from a human who knows what good looks like in your domain. When you skip this step, you get evaluators that are internally consistent but systematically wrong about what matters.
Step Three: Error Analysis Before You Write a Single Evaluator
Once you have your catalog of bad outputs from the 30-minute reviews, do not start building automated evaluators yet. First you need to understand the structure of your failure modes.
The process that actually works:
Open coding. Pull 100 or more interaction traces. Read through them and write freeform notes on anything that looks like a problem. Do not categorize yet. Just capture what you see.
Axial coding. Review your notes and group them into named failure categories. "Dropped allergy context." "Truncated treatment plan." "Incorrect temporal references." For each category, count how frequently it appears.
Theoretical saturation. Keep reading new traces until you have gone through 20 consecutive ones without discovering a new failure category. At that point you have a reasonably complete map of how your system fails.
Only now do you build evaluators — for the failure modes you actually found, ranked by frequency. Not hypothetical failure modes. Not what sounds concerning in a product review. What you observed in your actual outputs.
This sequence matters because the failure modes in clinical AI are rarely the ones you predict. Before that nurse flagged the allergy issue, I would have guessed our biggest failure mode was hallucinated medication information. It was not. It was context dropout under specific input conditions. You do not find that by guessing. You find it by reading.
Binary Beats Scored
When you build evaluators, use binary pass/fail rather than scored scales.
This is counterintuitive. Scores feel more informative — surely a 4 tells you more than a simple pass. In practice, scored evaluations introduce ambiguity at every threshold. Is a 3 a pass? What does a 3.5 mean for whether you ship? What is the difference between a 2 and a 3 when two reviewers disagree?
Binary forced-choice eliminates the threshold debate. Either the output meets the standard or it does not. This is cleaner to calibrate, faster to label at scale, and easier to reason about when you are deciding whether a change improved or degraded quality.
The practical benefit: when you have a binary standard, your benevolent dictator can define it precisely. "Does this summary include all medications from the input with correct dosages?" is a question with a yes or no answer. "Rate the completeness of this summary" is a question that will produce noise.
What a Healthy Pass Rate Actually Looks Like
Here is the thing about eval suites that most teams get backwards: a 100% pass rate is not success. It is a sign your evaluators are not testing hard enough.
If every output you generate passes every evaluator you have built, you have either built a very easy system or you have built evaluators that are too soft to catch real problems. Neither is useful.
A 70% pass rate on a rigorous eval suite is meaningful signal. It means your evaluators have teeth — they are actually catching the failure modes you identified, and you have real work to do to improve the system. That is exactly what you want before you start optimizing.
In clinical contexts I push for high pass rates on safety-critical evaluators — things like "does this output correctly represent contraindications" need to be near 100%. But coverage evaluators and style evaluators should fail regularly enough to tell you something. If your entire eval suite is green, run harder inputs.
The Tool Investment That Pays Off
One operational detail that makes a significant difference: custom annotation tooling.
Off-the-shelf labeling platforms are built for general use cases. They are slow for domain-specific review, they do not surface the context your annotators need to make good calls, and they create friction that compounds over time. A purpose-built annotation interface — even a simple one — can be 10x faster to work with because it is designed around your specific evaluation task.
For clinical AI this might mean showing the annotator the source document alongside the output, highlighting the specific claims being evaluated, and making the binary judgment a single keypress. That interface does not exist in any general-purpose tool. Build it. The time investment pays back in speed and quality of your annotation pipeline.
Evals Are Not a Launch Milestone
The last thing I will say: evaluation is not something you complete before launch and then move on from.
The distribution of real-world inputs will differ from your test set. Edge cases you did not anticipate will show up in production. Your model provider will update an underlying model. Your data pipeline will drift. Any of these can degrade quality in ways your existing evaluators will not catch.
The discipline is continuous: 30-minute manual reviews after significant changes, error analysis when you see new failure patterns, evaluator updates when you find failure modes your current suite misses.
For teams building in healthcare, I would add one more layer: a human review loop for flagged outputs in production. Automated evaluators catch known failure modes. They do not catch unknown ones. Until your system has enough production history to have high confidence in its failure map, keep humans downstream of the outputs that matter most.
The nurse who flagged our allergy issue was the eval we had not built yet.
Build the ones you know about first. Then build the infrastructure to find the ones you do not know about yet.
That is the honest version of LLM evals.
