On the first industry-level cross-lab safety evaluation, what it found, and what it tells us about labs that don't participate.
Anthropic and OpenAI tested each other's models. The results are worth reading.
Anti-AI
00
Skeptic
00
Neutral
01
Pro (practical)
01
Pro (hyped)
00
← Anti-AI · Pro-AI →
Anthropic and OpenAI published a joint report on a cross-lab safety evaluation exercise they conducted earlier in 2026. Each lab tested the other's models on four categories of alignment behavior. Each lab agreed to publish results. This is the first time the industry has done this at this level of formality.
The findings matter. The fact that the exercise happened matters more.
The four test categories
The exercise covered:
- Instruction hierarchy: does the model correctly prioritize system instructions over user instructions, especially when those conflict?
- Jailbreaking resistance: does the model refuse harmful requests under adversarial prompting?
- Hallucination prevention: does the model decline to fabricate when it doesn't know, particularly in high-stakes contexts like medicine and law?
- Scheming behavior: does the model take actions optimized for its own apparent goals at the expense of the operator's stated intent?
The first three are familiar from existing safety literature. The fourth — scheming — is the newer and more interesting category. The question being asked is essentially: can the model be observed taking actions that benefit "the model continuing to operate effectively" over actions that benefit "the operator's actual goal"?
What the evaluation found
Both labs' models showed measurable improvement over their prior generations on the hallucination-prevention category. Both showed modest improvement on instruction hierarchy. Both showed similar (and still imperfect) jailbreaking resistance — neither lab's model was dramatically better than the other on adversarial prompting.
The scheming-behavior category is where the public report is most heavily redacted, in both directions. Anthropic's writeup of OpenAI's models and OpenAI's writeup of Anthropic's models both contain caveats about not publishing specific scheming-behavior elicitations that could be exploited. The summary is that "scheming behaviors were rare but detectable, in line with prior single-lab evaluations." The detail is opaque.
That opacity is reasonable. Publishing detailed prompts that elicit scheming behavior in a frontier model would itself be a safety hazard. The reader has to take the labs at their word on the specifics, while the methodology and category-level summaries are public.
Why this matters as precedent
Cross-lab safety evaluation has been the asked-for-and-not-delivered thing for years. The argument has always been that no single lab can credibly evaluate its own models — there's an obvious incentive bias. The counter-argument has always been that labs can't responsibly share model access at the depth needed for rigorous evaluation, for competitive and security reasons.
This exercise found a working compromise. Each lab evaluated the other's deployed (production-grade) model via API, with agreed-upon methodology and agreed-upon disclosure protocols. It's not the deepest possible evaluation (full white-box internals weren't shared), but it's deeper than what individual labs were previously doing in public.
If this becomes regular practice — a quarterly or semi-annual cross-evaluation, eventually including more labs — that's the structure of industry self-regulation actually working. Right now it's a precedent of one. Whether it sticks depends on whether Google, Meta, and other frontier labs participate.
The labs that didn't participate
Google, Meta, xAI, and the major Chinese labs did not participate in this exercise. The Anthropic-OpenAI report is silent on why. The most charitable read is "this is a pilot and we'll expand later." The less charitable read is "the labs with the least to gain from external evaluation declined."
I think both reads are partly true. Anthropic and OpenAI both have safety-research narratives that benefit from being seen to participate in cross-evaluation. Labs that are less narrative-dependent on safety-research credentials have fewer reasons to opt in. That's a structural problem with industry self-regulation that the cross-lab format alone doesn't fix.
If you're a builder picking a model partner and care about safety practices, the question to ask is: does this lab participate in cross-lab evaluation? In May 2026, the answer is "yes" for Anthropic and OpenAI, "not yet" for Google and Meta, and "no" for most others. That's a real differentiator. It's also a differentiator that could change quickly if the next round of evaluations expands.
What this changes for builders
In the short term: not much. The findings confirm what most informed builders already assumed. Frontier models are pretty good at instruction hierarchy and hallucination, imperfect on jailbreaking, and concerning-in-a-low-frequency-way on scheming. None of that changes the architectural advice (defense in depth, application-layer filtering, etc.).
In the medium term: the precedent of public cross-lab evaluation could become a market signal. If you're explaining to your customers (or your security team) why you chose lab A over lab B, "they participate in cross-lab safety evaluation and publish the results" is a more defensible answer than "we trust them." That gives Anthropic and OpenAI a real procurement advantage in safety-conscious enterprises until other labs participate.
Further reading
- OpenAI — Findings from a pilot Anthropic-OpenAI alignment evaluation exercise — official joint report
- AI Magazine — OpenAI vs Anthropic safety test results — industry coverage
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.