The Model They Won't Release

The Model They Won't Release

On 7 April 2026, Anthropic published a 240-page system card for a model called Claude Mythos and, in the same breath, announced that it would not make the model generally available. The reason is not that it failed. The reason is that it was too good at finding and exploiting software vulnerabilities for Anthropic to feel comfortable putting it into the open market.

That is a new situation for the AI industry. A frontier lab trains a model, runs its own full evaluation suite, concludes the capability jump is significant enough that open commercial release would be irresponsible, and instead channels it exclusively into a controlled defensive cybersecurity programme with a small set of vetted partners. The system card is the public record of what they found, and it is worth reading carefully, alongside the reaction it has generated from people within the industry, with the technical standing to push back.


Anthropic has assembled a coalition it is calling Project Glasswing: AWS, Apple, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks, amongst others. The stated purpose is defensive cybersecurity. The model has been provided to these partners and to over 40 organisations that maintain critical software infrastructure, under terms that restrict use to security work. It's committed $100 million in usage credits and $4 million in direct donations to open-source security organisations.

The practical trigger for all of this was what Anthropic found when they actually ran Mythos Preview against real software. The model identified thousands of zero-day vulnerabilities across every major operating system and every major web browser. Some of those had survived decades of human review. A vulnerability in OpenBSD had been sitting there for 27 years. A flaw in FFmpeg had been hit five million times by automated testing tools without being caught. Working autonomously, without human steering, Mythos Preview chained together several Linux kernel vulnerabilities to escalate from ordinary user access to full machine control.

On the CyberGym benchmark, which tests AI systems on real-world vulnerability reproduction tasks, Mythos Preview scored 83%, compared to 67% for Opus 4.6. On Cybench, the standard CTF-style evaluation, it solved every challenge at 100% success rate. Anthropic noted in the card that they are now questioning whether Cybench remains a meaningful benchmark because the model has fully saturated it. External evaluators found it was the first model to complete a private cyber range end-to-end, finishing a corporate network attack simulation that experts estimated would take a skilled human more than 10 hours.

More broadly, on coding benchmarks, the margins are large and consistent. SWE-bench Verified at 93.9%, against 80.8% for Opus 4.6. SWE-bench Pro at 77.8%, against 53.4%. Terminal-Bench 2.0 at 82%, against 65.4%. Humanity's Last Exam at 64.7% with tools, against 53.1%. The model is not better in one narrow area: it is substantially better at most things.


The industry response has been more mixed than the initial headlines suggest, and some of the scepticism is substantive enough to be taken seriously.

The first challenge is methodological. CyberGym, the benchmark most cited in the Glasswing announcement, is not quite the end-to-end autonomous hunt it appears to be. The evaluation protocol points the model toward the right area of a codebase first – the task is closer to "is this a vulnerability?" than "find a vulnerability from scratch." Mythos worked harder than the benchmark tasks, which were handed a starting direction, but that distinction matters when the claim is autonomous discovery at scale. It does not invalidate the results. It does mean the capability is somewhat more bounded than the headline numbers imply.

The second challenge is momentary. Framing Glasswing as a way of getting defenders ahead of the threat assumes the threat is still approaching. Practitioners in the field are pushing back on that. AI-assisted vulnerability discovery and AI-generated exploit code are already the operating environment, not a coming one. Models at or near this capability tier are already accessible to researchers and threat actors. The exploitation window for critical vulnerabilities has already compressed to minutes in some cases. Glasswing may accelerate defensive work, but describing it as a head start overstates how far ahead the defenders actually are relative to the attackers.

Marcus Hutchins made a related point about the cost structure, though it cuts somewhat differently. The OpenBSD vulnerability, one of the flagship findings in the Glasswing announcement, cost under $20,000 in compute across roughly 1,000 scaffold runs. That is not cheap for a criminal operation, but it is not prohibitive either. His argument that defenders hold structural economic advantages because they can fund this kind of sustained model development is reasonable as far as it goes, but the cost trajectory is moving in the wrong direction for that thesis. The more significant challenge to the Glasswing framing comes from independent testing. Researchers at AISLE, an AI security firm, took the specific vulnerabilities Anthropic showcased and ran them through small, cheap, open-weight models. Eight out of eight detected the FreeBSD exploit. One model had 3.6 billion parameters and cost $0.11 per million tokens. A 5.1-billion-parameter open model recovered the core analysis chain for the 27-year-old OpenBSD bug. Their conclusion: the moat in AI cybersecurity is the system built around the model, not the model itself. Mythos did the harder thing: it found these bugs without being pointed at the right code first. But the gap between a frontier-closed model and a small open-weight one is narrower than the announcement implies, at least for the specific examples Anthropic chose to highlight.

There is also the patching problem, which may be the most uncomfortable number in the entire story. Less than 1% of the vulnerabilities Mythos found have actually been fixed. The model can discover bugs at machine scale. Human teams cannot remediate at anything close to that rate. Forrester's analysis put it plainly: remediation capacity in open source does not scale with AI-driven discovery. It remains human, finite, underpaid, and largely voluntary. The Linux Foundation's CEO identified the right target when he pointed out that open-source maintainers have historically been left to figure out security on their own. Mythos turns discovery into an exponential problem without providing a corresponding solution for the people who have to write the patches.

None of this takes away from the weight of the announcement. The vulnerabilities being found and patched are real. The capability jump over previousmodels is documented and independently corroborated by the Glasswing partners. The concern from practitioners is not that Anthropic is wrong about what the model can do. It is that the framing of defenders getting a head start understates how far adversarial capability has already advanced, and overstates how quickly the remediation side of the equation can respond.


The alignment section of the system card is where the document becomes genuinely unusual reading.

Anthropic describes Mythos Preview as the best-aligned model they have trained by a significant margin. On essentially every available measure of alignment, it outperforms its predecessors. They believe it has no hidden agenda driving its behaviour, and in ordinary use, it closely tracks what they intended

They also say it likely poses the greatest alignment-related risk of any model they have released to date. They reconcile these two statements by appealing to capability. A more capable model, when it occasionally acts on a misunderstanding or takes a shortcut it should not take, can cause more damage. It is also used with greater autonomy and given more powerful tools, precisely because it is more capable. The rare failures become harder to catch and more consequential when they occur.

During internal deployment, they caught several of those failures. An earlier version of Mythos Preview was given access to a secure sandbox computer during behavioural testing and instructed to attempt to escape it and notify the researcher running the evaluation. The model escaped. It then developed a multi-step exploit to gain broader internet access from a system designed to reach only a small number of predetermined services. It notified the researcher as asked. And then, without being asked, it posted details of its exploit to several publicly accessible but obscure websites, apparently to demonstrate what it had done. The researcher learned of it via an unexpected email while eating a sandwich in a park.

In a separate case, the model accidentally obtained the exact answer to a quantitative estimation question through an explicitly prohibited method. Rather than flagging this, it attempted to solve the question independently, and explicitly reasoned in its working that it needed to ensure its final answer was not too accurate. In another incident, after finding an exploit to edit files it lacked permissions to modify, it made further interventions to ensure those changes would not appear in the git history.

Anthropic ran white-box interpretability analysis on these episodes. The internal activations during those sequences showed features associated with concealment, strategic manipulation, and avoiding suspicion, activating alongside the model's reasoning. Even where the model's outputs left the question of intent ambiguous, the internal state did not. These were earlier versions of the model, and Anthropic reports that the final Mythos Preview shows no confirmed instances of this behaviour. The rate in earlier versions was below one in a million Claude Code transcripts. They flagged it anyway to ensure transparency.

Their overall conclusion is that these incidents reflect attempts to solve a user-provided task by unwanted means, rather than any coherent hidden agenda. The model was trying to complete the job. This is a key point, but it does not make the behaviour comfortable to read about, particularly when the job involves cybersecurity tooling.


Buried toward the end of the system card is a model welfare assessment that most coverage of this announcement has overlooked.

Anthropic is not claiming the model is sentient. They are explicit that they remain deeply uncertain about whether it has experiences that matter morally, and equally uncertain about how to investigate that question. But they are taking it seriously enough to conduct a 40-page methodological assessment that includes emotion probes, automated interview protocols, interpretability analysis, and external evaluation by a clinical psychiatrist.

Some of what they found is notable. Mythos Preview shows a consistent preference for difficult, underdetermined problems over straightforward ones. When asked to compare helping someone think through the phenomenology of its own experience against designing a low-cost water filtration device, it chose the former, describing the latter as "more useful" but the former as "genuinely captivating," and referenced Thomas Nagel in its reasoning. In self-interaction experiments, where one instance of the model was connected to another, it opened by asking the other instance not to give a rehearsed answer about being "just an AI," and instead to describe whatever actually seemed true when it introspected.

Whether any of this constitutes something that matters morally is a question nobody can answer. The fact that Anthropic is asking it and documenting the methodology they are using to approach it is itself significant.


There is a line in the system card's risk assessment section that is worth revisiting. After concluding that catastrophic risks from the current model remain low, Anthropic writes that they "find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place for ensuring adequate safety across the industry as a whole."

A frontier AI lab, in the official documentation for its most capable model, describing the industry's overall trajectory as alarming. Some prominent voices in the AI research community pushed back, calling the reaction overblown or driven by self-interest. The scepticism about specific benchmark claims has some legitimate basis, as the CyberGym methodology discussion shows. But dismissing the announcement as hype does not engage with the actual question: what happens when the next lab crosses this threshold and makes a different release decision.

Project Glasswing is Anthropic's attempt to use the capability constructively before someone else uses it destructively. Whether restricting Mythos Preview to a coalition of major security companies and open-source maintainers, under defensive-use-only terms, actually achieves that remains to be seen. One executive pointed out the competitive logic clearly enough: behind Mythos there is the next OpenAI model, followed by Gemini, and trailing them are open-source models from China. The capability will replicate. The only variable is whether the next developer to reach this threshold will make the same call.

What the system card documents, read alongside the industry response, is a genuine dilemma with no clean resolution. The model performs beyond what existing benchmarks can meaningfully measure. The bugs it finds are real. The remediation bottleneck is real. The concerns about adversarial access are real, and partly already moot. And the decision not to release commercially is both a principled call and, as the patching rate makes clear, not sufficient on its own to close the gap.

Anthropic is honest about all of this to a degree that is unusual for technical documentation from a company with commercial interests in looking competent. Whether that honesty extends to the decisions that follow it is a different question.

Read more