After years of watching smart teams mistake sampling for safety, I no longer ask how many AI tests we ran, only which failures we have made impossible by design.
Anthropic researcher Nicholas Carlini published a blog post describing how he set 16 instances of the company’s Claude Opus 4 ...