AI Agent GovernanceAI Agent FailuresResearchLLM GuardrailsAI Safety

The State of AI Agent Failures: What 4,391 Builder Complaints Reveal

Tenet EditorialJune 8, 202610 min read

The most common thing AI agent builders complain about is not the model being wrong. It is the agent doing something it should never have been allowed to do. Across 4,391 public discussions where a builder described a real governance pain, the single most frequent theme was permissions and scope (49% of those discussions), followed by uncapped cost (31%) and missing audit or oversight (30%). Hallucination, the failure mode that gets the headlines, is not what builders are filing issues about. They are filing issues about agents with too much reach, no spending limit, and no record of what happened.

This post summarizes what we found when we read the public record at scale. It is a snapshot, not a verdict, and the methodology and its limits are spelled out at the end.

How we built this dataset

We run an internal listening pipeline that continuously collects public, unprompted discussions from AI agent builders: GitHub issues and discussions, Reddit threads, Hacker News comments, developer forums, and YouTube transcripts. Every item is classified by a language model against a fixed taxonomy: is it relevant to agent building, which platform is it about, is there a real governance pain present, how severe is it, and which categories of pain does it touch.

The numbers in this post are drawn from that corpus as of mid-May 2026:

  • 55,589 raw public items collected.
  • 15,867 classified against the taxonomy.
  • 10,251 judged relevant to building agentic systems.
  • 4,391 that contained a concrete governance pain, spanning 77 distinct platforms and frameworks.

The 4,391 governance-pain discussions are the basis for every percentage below. These are not survey responses or interviews. They are things builders said in public, on their own, while trying to ship something.

The failures builders actually complain about

Here is the breakdown by category. The percentages add up to more than 100% because most discussions touch more than one category at once (more on that below).

Failure categoryShare of governance-pain discussions
Permissions and scope (agent can reach too much)49%
Runaway or uncapped cost31%
Missing audit trail or oversight30%
Credential and secret handling24%
Compliance and policy gaps23%
Approval and human-in-the-loop gaps21%
Observability (cannot see what the agent did)19%
Reliability (loops, unbounded retries, runaway behavior)15%

Read top to bottom, this is a portrait of a control problem, not an intelligence problem. The categories that dominate are about what the agent is permitted to do, what it costs while doing it, and whether anyone can see or undo it afterward. None of those are fixed by a smarter model.

Permissions: the agent can reach too much

Nearly half of all governance-pain discussions touch permissions and scope. The recurring shape: a builder gives an agent a tool or an integration, then discovers the agent can use it in ways they never intended, with no granular way to constrain it. The wishes attached to these discussions are revealing. Builders ask for things like a documented, reliable way to disable a built-in capability without guessing at internal IDs, the ability to gate auto-modifications behind a confirmation, and per-tool scoping that the agent's own logic cannot route around.

The theme underneath all of it: scope that lives inside the agent's reasoning is scope the agent can talk itself out of. What builders keep reaching for is a constraint that sits outside the loop.

Cost: billing alerts arrive after the money is gone

Cost shows up in 31% of governance-pain discussions, and it is rarely about price-per-token. It is about runaway behavior with no ceiling: a retry that never stops retrying, a loop that re-invokes the model on every pass, a token budget that does not align with anything the builder set. The most common cost wish across the corpus is simple and blunt: an upper bound. A hard cap on retry attempts. A per-tenant usage limit. A concurrency ceiling per provider. Builders are not asking for a dashboard that tells them they overspent. They are asking for something that stops the spend before it happens.

Audit and oversight: nobody can prove what the agent did

Audit and oversight appears in 30% of discussions, and observability in another 19%. Together, some form of "I cannot see, log, or verify what the agent actually did" touches 43% of all governance-pain discussions. The wishes here are about propagating errors that currently fail silently, logging when a safeguard does not fire, and getting a trustworthy record out of a system that mostly produces a transcript and a shrug. The pattern is that the agent ran, something happened, and the builder has no defensible account of the sequence.

Approval gaps: the dangerous action fires before anyone is asked

Approval and human-in-the-loop gaps show up in 21% of discussions. The complaint is almost never "I want to approve everything." It is "the one action I needed to approve fired on its own." Builders ask for the ability to gate destructive or irreversible actions specifically, to surface a confirmation in a subagent's view, and to roll back side effects when something downstream fails. The need is selective, not blanket: a backstop on the actions that actually hurt, not a prompt on every step.

Credentials: secrets handled by a system that improvises

Credential and secret handling appears in 24% of discussions. The pains cluster around tokens that do not refresh cleanly, secrets exposed or mishandled in transit, and OAuth flows that drift out of alignment. When an autonomous system holds the keys, every gap in how those keys are stored, rotated, and scoped becomes an agent-shaped attack surface.

The bigger finding: these failures are compound, not isolated

The most important pattern in the data is not which category ranks first. It is that 73% of governance-pain discussions touch more than one category at once. Cost runs together with reliability, because the loop that will not stop is also the loop that will not stop spending. Permissions run together with credentials, because the agent that can reach too far is holding the keys to do it. Audit runs together with approval, because the action nobody approved is also the action nobody can reconstruct.

This matters for how you fix it. A point solution for any single category leaves the other two or three live. The builders describing these problems are not hitting one wall. They are hitting a cluster, which is why bolting on a cost dashboard or a single approval prompt rarely makes the underlying anxiety go away.

How acute is this?

We score each governance pain for severity. Among the most acute discussions, the ones describing a serious, production-relevant failure rather than a minor annoyance, the ordering shifts toward the irreversible end. Permissions still leads (59% of the most severe discussions), but audit (41%) and compliance (39%) rise sharply, ahead of cost (23%). When the stakes go up, builders worry less about the bill and more about whether the agent could do something they cannot see, cannot undo, and cannot account for. Roughly 28% of all governance-pain discussions fell into this most-acute tier.

One more signal worth flagging: 53% of these governance pains were raised unprompted, meaning the builder brought up the control problem on their own while discussing something else entirely. This is not a topic people only discuss when asked. It surfaces on its own, in the middle of trying to ship.

Where the pain shows up

Governance pain is not concentrated in one corner of the ecosystem. It spans 77 platforms and frameworks in our corpus. The platforms attracting the most governance-pain discussion include OpenCode, MCP servers, LangChain, n8n, Claude Code, Cursor, LangGraph, Mastra, CrewAI, and AutoGen, among many others. The takeaway is not a ranking of which tool is worst. It is that the control problem travels with the agent pattern itself, not with any one framework. Whatever you build on, the same cluster of failures shows up.

What this means if you build agents

The data points to a few practical conclusions, each grounded in what builders said rather than what we wish they had said.

  1. Treat scope as something enforced outside the agent, not inside it. Permissions is the number-one complaint precisely because in-agent scoping can be reasoned around. The constraint that holds is the one the agent cannot reach.
  2. Put a hard ceiling on cost, not an alert. The most common cost wish in the entire corpus is an upper bound. Alerts tell you it already happened.
  3. Gate the irreversible, not the routine. The approval complaint is about the one dangerous action that fired unasked, not about wanting to approve everything. Selective gating on destructive actions is what builders actually request.
  4. Produce a record you can defend. Audit and observability together touch nearly half the discussions. If the agent ran and you cannot reconstruct what it did, that gap is itself a top-tier complaint.
  5. Expect these to come as a bundle. Because 73% of the pains are compound, plan to address the cluster, not a single line item.

Methodology and limits

These findings were synthesized from public, unprompted builder discussions: 55,589 items collected, 15,867 classified, 10,251 relevant, and 4,391 carrying a concrete governance pain across 77 platforms, as of mid-May 2026. Sources span GitHub issues and discussions, Reddit, Hacker News, developer forums, and YouTube transcripts.

Honest caveats:

  • Classification is model-assisted. Each item was labeled by a language model against a fixed taxonomy, not by a panel of human raters. Labels carry a confidence score and the taxonomy evolved across classifier versions during collection. Treat the percentages as well-supported estimates, not precise measurements.
  • Public complaints are a biased sample. People post when something breaks. A corpus of public discussions over-represents friction and under-represents the agents quietly working fine. This measures what builders complain about, which is the question we set out to answer, but it is not a measure of how often agents fail in absolute terms.
  • Multi-label by design. Most discussions touch several categories, so category shares sum to more than 100%. That is intentional and is the basis for the compound-failure finding.
  • No individuals are quoted or named. The wish examples in this post are paraphrased patterns aggregated across many discussions, not attributed quotes.

We expect to refresh this analysis as the corpus grows. The categories may reorder. The shape, that the loudest agent failures are governance failures rather than intelligence failures, has been stable across every cut we have run so far.


This research comes out of the listening pipeline we run at Tenet, where we build runtime governance for AI agents: cost caps, approval gates, and tamper-evident audit logs that sit outside the agent loop.

Ready to govern your AI agents?

Start with our free tier: 500 decisions per month, no credit card required.

Get Started Free