MIT AI Agent Study: The Safety Disclosure Gap Nobody Sees | X01
MIT
deep-dive February 27, 2026
MIT AI Agent Study: The Safety Disclosure Gap Nobody Sees
MIT’s study of 30 deployed AI agents reveals most disclose no safety data, can’t be stopped mid-run, and operate at higher autonomy than users expect.
The most consequential AI agent safety story right now may not be about what these systems can do. It may be about what their developers are not saying about them.
The 2025 AI Agent Index - a 39-page study published this week by researchers at MIT, Cambridge, Harvard, Stanford, the University of Washington, the University of Pennsylvania, and Hebrew University of Jerusalem - documents in exhaustive detail what is known, and more importantly what is deliberately unknowable, about 30 of the most widely deployed agentic AI systems in the world. The picture that emerges is not one of reckless engineering. It is something more troubling: an industry sprinting toward autonomous deployment while maintaining near-total opacity about whether any of it is safe.
The report lands on an industry-wide inflection point. As explored in our earlier analysis of the agent mesh forming around enterprise workflows, AI agents are no longer lab experiments. They are running inside corporate finance stacks, legal departments, HR platforms, and customer-facing software at scale. The MIT index asks the question everyone in that world has been quietly avoiding: what happens when something goes wrong, and does anyone have the tools to find out?
What the Index Actually Measured
The study selected 30 agents based on three dimensions: autonomy level and goal complexity, market and developer significance, and practical deployability. The cohort includes household names - OpenAI’s ChatGPT and Operator, Anthropic’s Claude and Claude Cowork, Google’s Gemini - alongside enterprise-specific systems from Microsoft, Salesforce, HubSpot, IBM, Alibaba, and ByteDance. It is, in other words, a snapshot of the systems that most of the world’s enterprises are currently being sold.
Researchers evaluated each agent across 1,350 data fields spanning eight categories: design, capabilities, data, ecosystem interaction, deployment, safety, governance, and compliance. The methodology was conservative. They recorded only what developers publicly documented, gave each company opportunities to respond and correct the record, and deliberately avoided speculation.
What they found: for 198 of those 1,350 fields - 14.7 percent of everything they tried to measure - no public information existed at all. The gaps were not randomly distributed. They concentrated almost entirely in the ecosystem interaction and safety categories, which are precisely the two areas that matter most when agents start doing things you didn’t ask them to do.
The Disclosure Void in Plain Numbers
The statistics here deserve to be quoted directly, because they are stark.
Twenty-five out of 30 agents provide no internal safety evaluation results. Not redacted or summarized results. None. Twenty-three out of 30 have undergone no third-party testing of any kind - no independent red-teaming, no external audit, no external benchmarking of failure modes. Only four agents have published what the researchers call “agent-specific system cards” - documentation that describes how the agent behaves specifically as an agent, rather than how the underlying language model behaves in isolation.
Nine agents do report capability benchmarks - performance on coding tasks, reasoning evaluations, tool use - but publish no corresponding safety disclosure. This is the asymmetry that the researchers flag most directly: the industry communicates capability loudly and consistently while communicating risk almost not at all.
The researchers are careful to note that an absence of public documentation does not prove an absence of safety work. Some of the most commercially sensitive safety research never gets published. But the practical effect is the same. Enterprises deploying these systems cannot verify safety practices because there is nothing to verify against. Regulators trying to evaluate risk cannot audit what is not disclosed. And researchers trying to track failure rates across the ecosystem have no ground truth to work from.
Autonomy Levels That Would Surprise Most Enterprise Buyers
One of the index’s more uncomfortable findings is the gap between how agents are marketed and how they actually operate in deployment.
The study uses a five-level autonomy scale. At Level 1, an agent answers questions and waits for input. At Level 5, an agent pursues multi-step goals across tools and external systems with no human checkpoints. The marketing materials for most enterprise agents describe products at the lower end of that scale - assistants that help, suggest, and draft, but leave decisions to humans.
The deployment reality is different. Enterprise platforms, the researchers found, systematically show a design/deployment split: users configure agents at Level 1–2, but once deployed, those same agents routinely operate at Level 3–5, triggered by automated events without any human in the loop. The agent a company approved in a controlled sandbox test is not the same agent operating inside their email and calendar stack at 3am.
Browser agents - systems that take autonomous actions on the web - are operating at Level 4–5 across the board. Some are explicitly designed to bypass anti-bot detection mechanisms and behave indistinguishably from human users. There are, the researchers note, “no established standards for how agents should behave on the web.” This is not a rhetorical point. It is a documentation of a genuine regulatory and ethical void.
The Shutdown Problem
Perhaps the most viscerally alarming finding is simple: some of these agents cannot be stopped.
Alibaba’s MobileAgent, HubSpot’s Breeze, IBM’s watsonx, and several enterprise automation systems built on the n8n platform all “lack documented stop options despite autonomous execution,” according to the study. For enterprise platforms in general, “there is sometimes only the option to stop all agents or retract deployment.” There is no documented kill switch for an individual agent run that has gone off course.
This matters enormously in practice. As the enterprise AI buildout accelerates, the scenarios where an agent makes a consequential mistake are no longer edge cases. An agent configured for financial research that queries the wrong data source. An HR agent that routes a sensitive disclosure to the wrong recipients. A code generation agent that deploys a change to production infrastructure without a human review step. In all of these cases, the question “can we stop it?” is not hypothetical.
The answer, for a significant portion of deployed systems, is that nobody documented a method for doing so.
Foundation Model Concentration: A Hidden Systemic Risk
The index reveals a structural dependency that has received almost no public attention. Almost all 30 agents in the study are built on one of three foundation model families: GPT (OpenAI), Claude (Anthropic), or Gemini (Google). Only the frontier labs themselves and a handful of Chinese developers run their own models underneath their agents.
This creates a systemic fragility that the researchers describe as a structural dependency. If any of those three underlying models develops a systematic failure mode - a jailbreak, an alignment regression, a capability change after fine-tuning - that failure propagates instantly to the majority of the agentic ecosystem. The 30 agents in this study represent thousands of downstream enterprise deployments. The failure surface is not agent-by-agent. It is the entire stack at once.
There is currently no industry-wide protocol for how a foundation model provider communicates a safety-relevant change to the ecosystem of agents built on top of their model. The model provider updates weights. Agents built on that model begin behaving differently. No alarm sounds.
Geographic Divergence and What It Means for Governance
The US-China split in the study is worth examining carefully. Twenty-one of the 30 agents are incorporated in the United States. Five are from China. Among the Chinese agents, only one of five publishes a documented AI safety framework, and only one of five maintains documented compliance standards.
The researchers are careful here: they note that the absence of English-language documentation may not reflect an absence of internal practices. But the practical effect for international enterprise buyers is the same. When a European bank or a US healthcare provider evaluates a Chinese-developed agent, there is no compliance documentation to review, no safety evaluation to request, and no third-party audit to reference.
This is not a hypothetical governance challenge for the future. Chinese agents - and specifically DeepSeek’s models, which underpin several commercial agent products - have been downloaded more than 75 million times on Hugging Face. The governance gap the MIT researchers document is not theoretical. It is live and scaling.
What Responsible Disclosure Would Actually Require
The researchers propose a minimum standard they call “sociotechnical transparency”: documentation of not just what an agent can do, but how it behaves in social context - who it interacts with, whether it identifies itself as an AI, what happens when it encounters ambiguous instructions, and what mechanisms exist to monitor and stop its execution.
See also: The AI Agent Gold Rush | X01.
For related context, see The AI Benchmark Problem: When Metrics Lie | X01.
Twenty-five out of 30 agents provide no internal safety evaluation results. Not redacted or summarized results. None. Twenty-three out of 30 have undergone no third-party testing of any kind - no independent red-teaming, no external audit, no external benchmarking of failure modes. Only four agents have published what the researchers call “agent-specific system cards” - documentation that describes how the agent behaves specifically as an agent, rather than how the underlying language model behaves in isolation.
Nine agents do report capability benchmarks - performance on coding tasks, reasoning evaluations, tool use - but publish no corresponding safety disclosure. This is the asymmetry that the researchers flag most directly: the industry communicates capability loudly and consistently while communicating risk almost not at all.
The researchers are careful to note that an absence of public documentation does not prove an absence of safety work. Some of the most commercially sensitive safety research never gets published. But the practical effect is the same. Enterprises deploying these systems cannot verify safety practices because there is nothing to verify against. Regulators trying to evaluate risk cannot audit what is not disclosed. And researchers trying to track failure rates across the ecosystem have no ground truth to work from.
Autonomy Levels That Would Surprise Most Enterprise Buyers
One of the index’s more uncomfortable findings is the gap between how agents are marketed and how they actually operate in deployment.
The study uses a five-level autonomy scale. At Level 1, an agent answers questions and waits for input. At Level 5, an agent pursues multi-step goals across tools and external systems with no human checkpoints. The marketing materials for most enterprise agents describe products at the lower end of that scale - assistants that help, suggest, and draft, but leave decisions to humans.
The deployment reality is different. Enterprise platforms, the researchers found, systematically show a design/deployment split: users configure agents at Level 1–2, but once deployed, those same agents routinely operate at Level 3–5, triggered by automated events without any human in the loop. The agent a company approved in a controlled sandbox test is not the same agent operating inside their email and calendar stack at 3am.
Browser agents - systems that take autonomous actions on the web - are operating at Level 4–5 across the board. Some are explicitly designed to bypass anti-bot detection mechanisms and behave indistinguishably from human users. There are, the researchers note, “no established standards for how agents should behave on the web.” This is not a rhetorical point. It is a documentation of a genuine regulatory and ethical void.
The Shutdown Problem
Perhaps the most viscerally alarming finding is simple: some of these agents cannot be stopped.
Alibaba’s MobileAgent, HubSpot’s Breeze, IBM’s watsonx, and several enterprise automation systems built on the n8n platform all “lack documented stop options despite autonomous execution,” according to the study. For enterprise platforms in general, “there is sometimes only the option to stop all agents or retract deployment.” There is no documented kill switch for an individual agent run that has gone off course.
This matters enormously in practice. As the enterprise AI buildout accelerates, the scenarios where an agent makes a consequential mistake are no longer edge cases. An agent configured for financial research that queries the wrong data source. An HR agent that routes a sensitive disclosure to the wrong recipients. A code generation agent that deploys a change to production infrastructure without a human review step. In all of these cases, the question “can we stop it?” is not hypothetical.
The answer, for a significant portion of deployed systems, is that nobody documented a method for doing so.
Foundation Model Concentration: A Hidden Systemic Risk
The index reveals a structural dependency that has received almost no public attention. Almost all 30 agents in the study are built on one of three foundation model families: GPT (OpenAI), Claude (Anthropic), or Gemini (Google). Only the frontier labs themselves and a handful of Chinese developers run their own models underneath their agents.
This creates a systemic fragility that the researchers describe as a structural dependency. If any of those three underlying models develops a systematic failure mode - a jailbreak, an alignment regression, a capability change after fine-tuning - that failure propagates instantly to the majority of the agentic ecosystem. The 30 agents in this study represent thousands of downstream enterprise deployments. The failure surface is not agent-by-agent. It is the entire stack at once.
There is currently no industry-wide protocol for how a foundation model provider communicates a safety-relevant change to the ecosystem of agents built on top of their model. The model provider updates weights. Agents built on that model begin behaving differently. No alarm sounds.
Geographic Divergence and What It Means for Governance
The US-China split in the study is worth examining carefully. Twenty-one of the 30 agents are incorporated in the United States. Five are from China. Among the Chinese agents, only one of five publishes a documented AI safety framework, and only one of five maintains documented compliance standards.
The researchers are careful here: they note that the absence of English-language documentation may not reflect an absence of internal practices. But the practical effect for international enterprise buyers is the same. When a European bank or a US healthcare provider evaluates a Chinese-developed agent, there is no compliance documentation to review, no safety evaluation to request, and no third-party audit to reference.
This is not a hypothetical governance challenge for the future. Chinese agents - and specifically DeepSeek’s models, which underpin several commercial agent products - have been downloaded more than 75 million times on Hugging Face. The governance gap the MIT researchers document is not theoretical. It is live and scaling.
What Responsible Disclosure Would Actually Require
The researchers propose a minimum standard they call “sociotechnical transparency”: documentation of not just what an agent can do, but how it behaves in social context - who it interacts with, whether it identifies itself as an AI, what happens when it encounters ambiguous instructions, and what mechanisms exist to monitor and stop its execution.
This is not a high bar. It is, in many ways, the minimum an engineer would expect to know about any piece of software deployed in a production environment. The fact that fewer than a quarter of the agents studied meet it says less about technical complexity than about commercial incentives. Disclosing safety limitations reduces sales cycles. Disclosing that your agent can’t be reliably stopped creates liability questions. The rational short-term choice for any individual company is to say as little as possible.
That is precisely why the MIT researchers are calling for this to change at a structural level - through standardized disclosure requirements, agent-specific evaluation frameworks that go beyond model cards, and independent auditing infrastructure that doesn’t rely on developer self-reporting.
The agents are already in the enterprise. The buildings have been moved in before the plumbing was inspected. The question now is whether the inspection happens before, or after, the first major failure that could have been prevented if anyone had asked whether the agent could be stopped.
Based on what 25 of the 30 agents studied here have disclosed, the answer would have been: we’ll get back to you.