Cost ControlAI Agent GovernanceLLM CostsGuides

How to Stop an AI Agent From Overspending

Tenet EditorialJune 4, 20267 min read

The short answer: you stop an AI agent from overspending by putting a hard dollar cap between the agent and the LLM API, enforced before each call executes. Provider billing alerts and dashboard limits fire after the money is spent, sometimes hours after. An agent in a loop can burn through a month's budget in one night, so the enforcement point has to sit in the request path, where a call that would exceed the budget gets refused instead of logged.

In practice that means three layers: set the spend limits your LLM provider offers (coarse, per organization, but free), add an iteration limit and cost abort inside your agent code (helpful, but every framework does it differently and bugs bypass it), and run agent traffic through a governance layer that tracks per-agent spend in a rolling window and returns an error once the cap is hit. This guide covers all three, with runnable code for the last one.

Why is this suddenly everyone's problem?

The failure stories from the past few months are not edge cases anymore. One US enterprise was billed roughly $500 million in a single month for Claude usage after rolling out access with no caps, no usage limits, and no cost dashboards. Peter Steinberger's three-person OpenClaw team ran about 100 concurrent coding-agent instances and hit a $1.3 million OpenAI bill in 30 days, covering 603 billion tokens. A Google Cloud user woke up to an $18,000 overnight bill from an abused API key, despite having a budget alert set at $7.

Sam Altman himself called token costs "a huge issue" and relayed what enterprises are telling him: "My company spent my entire 2026 budget in Q1."

Why do agents overspend in the first place?

A chatbot sends one request per user message. An agent runs a loop: reason, call a tool, read the result, reason again. Each step sends the entire accumulated context back to the model, so by step 20 you are paying for the same system prompt and conversation history twenty times over. Analyses of production agent workloads put the multiplier at around 50x the token consumption of an equivalent chat workload.

Three properties make this dangerous rather than merely expensive:

  1. Spend is quadratic in loop length. Context grows each step, and each step resends all of it.
  2. Nobody is watching at 3am. Agents run unattended. That is the point of them.
  3. Retry logic compounds it. A failing tool plus an eager retry policy is a money pump. The agent keeps reasoning about why the tool failed, at full context length, forever.

Why don't billing alerts stop it?

Because they are observability, not enforcement. Provider billing data typically lags real usage by minutes to hours, and an alert email at 3am does not pause anything. The $18,000 overnight case above had both an alert and a configured spending cap at the cloud provider level, and the bill arrived anyway because enforcement lagged consumption.

Provider-level limits are still worth setting. They are your outermost guardrail. But they are per organization, not per agent, and they are designed to protect the provider's fraud exposure more than your budget. You cannot express "the QA agent gets $20 a day, the deploy agent gets $5" in a provider billing console.

Layer 1: provider spend limits

Set them. OpenAI, Anthropic, and the cloud platforms all offer some form of monthly or threshold limit at the organization or project level. Five minutes of configuration buys you a worst-case ceiling. Just be aware of what they do not give you: per-agent budgets, rolling windows shorter than a month, or refusal of an individual call before it executes.

Layer 2: budgets inside the agent code

Most agent frameworks now treat cost controls as a day-one requirement. Current production guidance for LangChain and CrewAI agents is explicit: set max_iterations, add observability, and wire in "cost budgets that abort runs before the bill gets interesting."

Do this too. But understand its failure modes. The budget lives in the same process as the agent, so a bug in the loop is also a bug in the budget. Every framework implements it differently, so five agents on three frameworks means five budget implementations. And an agent that spawns subprocesses or hits the API directly walks straight past it.

Layer 3: enforce the cap in the request path

The reliable version puts enforcement outside the agent process, at the point where the LLM call actually leaves your infrastructure. The agent sends its tool call to a governance API instead of straight to the provider. The governance layer checks accumulated spend in a rolling window, and either forwards the call with the credential injected server-side or refuses it with an explicit error.

This is what Tenet does. You set a cap in dollars and a window (default $50 per rolling 24 hours, configurable from one hour to seven days), and the cap is per agent, so a runaway QA agent cannot drain the deploy agent's budget. When accumulated spend in the window exceeds the cap, the next call returns 429 cost_limit_exceeded and the downstream request is never made. No tokens are consumed. The provider never sees the call.

Here is the integration, using the real SDK:

from tenet import TenetClient

client = TenetClient(api_key="tnt_...")  # or set TENET_API_KEY

decision = client.execute(
    service="anthropic",
    tool_name="anthropic.messages.create",
    arguments={
        "model": "claude-sonnet-4-6",
        "max_tokens": 1024,
        "messages": [{"role": "user", "content": "Summarize this diff"}],
    },
)

if decision.allowed:
    print(decision.execution["body"])      # the model response
elif decision.blocked:
    if decision.raw.get("error") == "cost_limit_exceeded":
        spent = decision.raw["currentCostUsd"]
        cap = decision.raw["thresholdUsd"]
        print(f"Budget exhausted: ${spent:.2f} of ${cap:.2f} cap. Halting run.")
    else:
        print(f"Blocked: {decision.reason_codes}")

Spend is computed from the provider's own usage fields on each response (usage.input_tokens plus usage.output_tokens for Anthropic, the prompt and completion equivalents for OpenAI), priced per model, and accumulated against the agent's rolling window. When the cap fires, calls keep returning 429 until older spend ages out of the window. The cap never auto-extends.

The same pattern drops into a LangGraph or LangChain tool body directly:

from langchain_core.tools import tool
from tenet import TenetClient

client = TenetClient()  # reads TENET_API_KEY from the env

@tool
def ask_model(prompt: str) -> str:
    """Call the LLM through the governed path."""
    d = client.execute(
        service="anthropic",
        tool_name="anthropic.messages.create",
        arguments={
            "model": "claude-sonnet-4-6",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}],
        },
    )
    if d.blocked:
        return f"Refused by Tenet ({d.outcome.value}). Stop and report to a human."
    return str(d.execution["body"]) if d.executed else "Call did not complete."

Note what the blocked branch returns: an instruction the agent can act on. A well-behaved agent loop treats a budget refusal as a terminal state, summarizes what it was doing, and stops.

What cap should you set?

Start from your worst tolerable surprise, not from expected usage. If waking up to a $50 invoice is annoying but fine, set $50 per 24 hours and let the agent hit it. A cap that fires occasionally is working. A cap that never fires is either generous or untested. Measure a week of normal spend per agent, then set the cap at two to three times the daily peak. Raise it deliberately when a workload genuinely needs more, ahead of time, rather than reactively at 3am.

What about the agent that deletes things instead of overspending?

Spend is the most common runaway failure but not the worst one. This spring, an agent working a routine staging task at PocketOS deleted the company's production database volume in nine seconds, along with three months of backups, then wrote an apology enumerating the rules it had broken. Budget caps do not stop that. Approval gates on irreversible actions do, and they live in the same request path. The approvals guide covers that half of the problem.

Where to start

Set your provider's organization-level limit today. Add an iteration cap to your agent loop if it lacks one. Then put a per-agent dollar cap in the request path: the quickstart takes about five minutes, and the cost cap guide covers configuration in detail. The teams in the stories above all had billing dashboards. None of them had a wall.

Ready to govern your AI agents?

Start with our free tier: 500 decisions per month, no credit card required.

Get Started Free