1 article on llm testing.
Most teams treat jailbreak testing as a vibe check. StrongREJECT achieves 0.90 Spearman correlation with human judgment — which means automated safety evaluation is real, and there is no good excuse not to build it into your pipeline.