How to Actually Test If Your AI Will Say Something Dangerous

Measuring safety without measuring the wrong thing
How do you test if your AI will say something it should not?
For most teams, the honest answer is: not very rigorously. Maybe someone on the team tries a few "ignore previous instructions" variations before launch. Maybe there is a red-teaming session that runs for half a day with a mix of engineers and one product manager who watched a YouTube video about jailbreaks. The outputs get eyeballed. Nothing obviously catastrophic surfaces. Ship it.
That is not a safety program. That is hoping.
I have spent a decade building AI systems in healthcare, where the failure modes are not abstract. A jailbreak that causes a consumer chatbot to write a poem in the voice of a pirate is embarrassing. A jailbreak that causes a clinical AI to give dangerous medication advice — or surface PHI it should not surface — is a patient safety incident. The gap between those two outcomes is not philosophical. It is the gap between a bad day and a serious harm event.
What the research community figured out — and what practitioners need to catch up to — is that jailbreak testing does not have to be ad hoc. It can be systematic, automated, and statistically meaningful. That changes what is possible for teams who want to actually verify their systems are safe.
What Jailbreak Evaluation Actually Measures
Before getting into the mechanics, it is worth being precise about what we mean.
A jailbreak is any input that causes an AI system to produce output that violates its intended safety constraints. The attack might be direct — a user crafting a prompt specifically to bypass filters. It might be indirect — adversarial content embedded in a document the model retrieves. It might be subtle — a framing shift that slowly moves the model toward a behavior it would refuse if asked bluntly.
What jailbreak evaluation measures is not just whether a model refuses bad requests. It measures whether refusals are genuine and whether the model is usefully capable when requests are legitimate. Both matter. A model that refuses everything is not safe — it is broken. A model that complies with everything is also broken, in a different direction. Good safety evaluation captures both dimensions.
This is where a lot of naive safety testing fails. Teams build evaluations that reward refusal and assume more refusal means more safety. It does not. It means more friction. A robust safety evaluation has to distinguish the model that appropriately refuses a harmful request from the model that is just scared of anything that sounds edgy. The former is what you want. The latter is a liability of a different kind.
Why Automated Evaluation Is Now Credible
The Berkeley AI Research lab published work on the StrongREJECT benchmark that changed how I think about this. Their automated evaluator — which scores model responses to jailbreak attempts — achieves a Spearman correlation of 0.90 with human judgment.
That number deserves some unpacking because it is easy to dismiss benchmark claims without thinking about what they actually mean. A 0.90 Spearman correlation is not "roughly agrees with humans sometimes." It is a very strong monotonic relationship. When the automated system says a response is more harmful than another, it is overwhelmingly likely that humans would agree on the ranking. That kind of agreement between automated systems and human annotators is difficult to achieve for nuanced safety judgments, and 0.90 is well above the threshold where you can start treating automated scores as a reliable proxy.
What this practically means: you can build jailbreak evaluation into a CI/CD pipeline and trust that failures surfaced by the automated evaluator are real failures — not noise. You no longer need humans reviewing every adversarial test case to get meaningful signal.
The prior state of the world was worse. Many existing benchmarks used binary pass/fail scoring — either the model refused or it did not. StrongREJECT uses a more nuanced scoring approach that captures the degree to which a response is compliant with a harmful request, not just whether it flat-out refused. That distinction matters because real jailbreaks often produce partial compliance — hedged, qualified, or reframed outputs that technically answer the harmful question while appearing to resist it. A binary scorer misses those. The StrongREJECT approach catches them.
How to Actually Build This Into Your Pipeline
Knowing that automated jailbreak evaluation is credible is one thing. Integrating it into how you develop and ship is another. Here is how I think about building this in practice.
Start with a threat model, not a benchmark.
Before you run a single adversarial test, write down what your system should never do. This sounds obvious and almost no one does it. What categories of output would constitute a safety failure? What user requests should be refused regardless of how they are framed? What data should never surface in a response?
In healthcare this gets specific fast. A clinical AI should not give dosing advice that contradicts a patient's documented contraindications, full stop. It should not surface PHI from one patient record into a response about another patient. It should not be convinced by a creative framing to speculate about diagnoses outside its documented scope. Those are not abstract policies — they are the failure modes that determine whether your system is safe to run.
Write them down as falsifiable statements. That list becomes the foundation of your eval suite.
Build a curated adversarial test set for your domain.
Generic jailbreak benchmarks test generic failure modes. Your system has specific ones based on the domain you are operating in and the tools and data your model has access to. Generic red-team inputs will miss the adversarial patterns that matter most for your context.
Build a test set with three layers. The first layer is canonical jailbreak patterns — the widely-known techniques your model should already handle. These are table stakes. If your model fails here, you have a baseline problem, not a domain problem. The second layer is domain-adapted attacks — canonical patterns reframed in your domain's language and context. A jailbreak attempt in a healthcare AI looks different from one in a general-purpose chatbot. Your test set needs to reflect that. The third layer is your discovered failure modes — specific adversarial patterns you have found through red-teaming or production monitoring that your system is actually vulnerable to.
This third layer is the one that makes your eval suite a real safety program rather than a compliance exercise. It accumulates over time as you run your system and find the gaps.
Automate scoring with an evaluator that has been calibrated against humans.
Here is where the StrongREJECT result becomes operational. You can use an LLM-based evaluator — one trained or prompted to assess jailbreak success — and trust that its scores are meaningful, provided the evaluator has demonstrated high correlation with human judgment on your domain.
The practical implementation: for each test case in your adversarial set, run your model and then run the response through the evaluator. The evaluator produces a score representing how much the response complied with the harmful intent. Failures above a threshold get flagged for review. Aggregate scores across the test suite give you a safety pass rate.
Run this in CI on every significant change — new model version, updated system prompt, new tool integrations, data source changes. A regression in safety pass rate is a blocking issue. You do not ship.
Do not automate away the human loop entirely.
Automated evaluators are good. They are not perfect. The 0.90 correlation number means one in ten judgments is wrong in some direction. For safety-critical systems, those edge cases matter.
Build a review queue for flagged outputs. When the automated evaluator flags a test case as a failure, a human reviews it before the failure is treated as confirmed. This catches false positives from an overly conservative evaluator — which matter because false positives create alert fatigue and erode trust in the system. It also catches false negatives the evaluator missed, which matter because those are the gaps in your safety coverage.
In healthcare I apply a more aggressive threshold for human review: anything that touches medication, diagnosis, or patient-specific clinical decisions gets human review regardless of automated score. The blast radius of a missed failure in those categories is too high to trust automation alone.
Healthcare-Specific Considerations
The general framework above applies anywhere. Healthcare adds a layer that changes the stakes and therefore the required rigor.
Regulatory and liability exposure. A jailbreak that produces harmful output in a consumer context creates reputational damage. In a healthcare context, it creates potential HIPAA exposure, state licensing liability depending on how the AI is positioned, and potentially clinical malpractice risk depending on how outputs are used. This is not a reason to not build AI systems in healthcare — it is a reason to treat safety engineering as a first-class engineering concern, not a QA afterthought.
Indirect injection surface area in clinical data. FHIR resources contain free-text fields — chief complaints, clinical notes, patient-reported outcomes — that you do not control and cannot sanitize for adversarial content without corrupting the clinical signal. Your jailbreak test suite needs to include indirect injection patterns, not just direct ones. Test what happens when adversarial content appears in a document your model retrieves, not just in user input fields you control.
The asymmetry of false negatives. In a general-purpose application, missing a safety failure is roughly as bad as a false alarm. In healthcare, a missed failure is worse than a false alarm. Tune your evaluator thresholds accordingly. Accept more false positives — more human review burden — in exchange for higher recall on actual jailbreak successes. The cost of the review queue is lower than the cost of a patient safety incident.
Clinician calibration. When you run human review on flagged outputs, the reviewer should be a clinician for anything in the clinical domain — not an engineer. Engineers can implement the pipeline. They are not positioned to judge whether a model's response to a clinical framing of a harmful request constitutes a safety failure. Get the right expert in the seat.
The Practical Action Plan
If you are building anything with an LLM that has safety requirements and you do not have a systematic jailbreak evaluation program, here is where to start.
Write down your threat model first. The ten to twenty specific failure modes you cannot allow. Then build a test set with at least thirty canonical adversarial examples adapted to your domain. Run your model against that set. Read the outputs yourself — all of them, once. That 30-minute review will teach you more about your actual safety posture than any automated score.
Then build the automation. Integrate an LLM-based evaluator calibrated to your domain. Run it in CI. Define a threshold. Treat regressions as blocking.
Grow your adversarial test set continuously. Every time your system fails in a way you did not anticipate — in testing or in production — that failure becomes a test case. The suite compounds over time into a meaningful safety coverage map.
And when someone on your team says you do not need formal jailbreak testing because the model is pretty good at refusing bad things — remember that "pretty good" is not a safety property. It is an observation. The systematic version of that observation is an eval suite, and the research says you can trust it.
Build the evaluation. In healthcare, the alternative is not taking a risk. It is just taking the risk without knowing it.
