LLM Token Budget Strategies for Agents: Stop Runaway Costs Before They Start

Share
April 16, 2026·10 min read·engineering

The short version

LLM token budget strategies for AI agents fall into five layers: per-request ceilings, per-session rolling budgets, per-key monthly caps, model-tier routing, and circuit breakers. Apply all five at the gateway layer so budget enforcement needs only a two-line client change (base_url + API key), remains agent-framework-agnostic, and is impossible for the agent to bypass. Without these controls, a single autonomous agent loop can burn $200-$2,000 overnight — and the most common response is to discover the bill 30 days later.

Why agents break your LLM budget

Traditional LLM usage is human-paced: a developer sends a prompt, reads the result, maybe sends a follow-up. The cost curve is linear and predictable. Agent-driven usage is fundamentally different:

  • Loops. An agent calls the LLM, parses the response, takes an action, calls the LLM again. A single user request can trigger 10-100 LLM calls. If the agent gets stuck in a retry loop, that number becomes unbounded.
  • Context window stuffing. Agents accumulate tool results into their context. By iteration 15, the context window is 80K tokens of accumulated state, and every subsequent call sends that entire payload again. Token usage grows quadratically, not linearly.
  • Model escalation. Some frameworks auto-escalate to larger models when the smaller model fails — “if GPT-4o-mini can't do it, try Claude Opus.” That's a 30x per-token cost jump triggered by a parsing error.
  • No human in the loop. A batch of 50 agent tasks kicks off at 2 AM. Nobody is watching. By 6 AM, the damage is done.

The core problem: LLM providers charge per token with no built-in spending cap. OpenAI has usage limits but they're account-wide and blunt. You need fine-grained budget controls that match the way agents actually consume tokens — per request, per session, per key, per model tier.

Five budget strategies that actually work

Strategy 1

Per-request token ceiling

Set a maximum max_tokens on every outbound request. This limits the response side, but combine it with a gateway-enforced input ceiling to limit the request side too. A typical ceiling: 4,096 output tokens and 32,000 input tokens per request.

Cost cap per request: At GPT-4o rates ($2.50/$10 per MTok), a single request can't exceed ~$0.12. Without the ceiling, an agent sending 128K context + 16K output hits ~$0.48 per request.

Strategy 2

Per-session rolling budget

Track cumulative token spend per agent session (identified by a session ID or trace ID). Once the session hits its budget — say $5 or 500K tokens — the gateway returns a 429 and the agent framework handles it as a graceful stop.

Why it works: This is the single most effective control against runaway loops. A stuck agent can't burn more than the session ceiling no matter how many iterations it runs.

Strategy 3

Per-key monthly cap

Issue separate API keys to each team, project, or environment — and set a monthly dollar ceiling on each key. The intern experimenting with agents gets a $25/month key. The production RAG pipeline gets a $500/month key. Neither can exceed its ceiling.

Granularity matters: A single org-wide cap doesn't help you find which project overspent. Per-key caps give you both enforcement and attribution.

Strategy 4

Model-tier routing rules

Define routing policies at the gateway: dev-environment keys can only access GPT-4o-mini and Claude Haiku. Production keys can access GPT-4o and Claude Sonnet. Nobody except the ML team can call Opus or o3-pro. This prevents accidental cost escalation.

Real savings: Restricting dev environments to mini-class models saves 90-95% on development-phase token costs, which typically account for 60% of total LLM spend in early-stage teams.

Strategy 5

Circuit breaker (rate-of-spend detection)

Monitor the rate of spending, not just the cumulative total. If a key's spend rate exceeds 3x its trailing 7-day average in a 15-minute window, auto-throttle it to 1 request/second and alert the key owner. This catches runaway loops before they hit the monthly cap.

Why a monthly cap isn't enough: A $500/month cap still allows burning $500 in 20 minutes. The circuit breaker catches the anomalouspattern and gives you time to investigate before the cap is exhausted.

Implementing budget enforcement at the gateway

The key insight: budget enforcement must live outside the agent code. If the agent checks its own budget, a buggy agent can skip the check. If a gateway enforces the budget before forwarding the request, the agent literally cannot make an LLM call that violates the policy.

Python — agent code is budget-unaware, gateway handles enforcement
from openai import OpenAI

# The agent has no budget logic — it just calls the API
client = OpenAI(
    api_key="osah_project_key_dev",  # This key has:
    #   - Wallet balance acts as a spending ceiling
    #   - Per-project max_tokens_per_request policy
    #   - Premium models blocked for managed-tier keys
    base_url="https://api.opensourceaihub.ai/v1",
)

# If the agent exceeds the wallet balance or token policy,
# the gateway returns HTTP 402 or 429 before forwarding:
# { "error": "budget_exceeded" }
#
# Standard agent frameworks handle 402/429 as a stop signal.

def agent_step(messages):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

This pattern works with any agent framework — LangChain, LangGraph, CrewAI, AutoGen, custom loops — because all of them use an HTTP client to call the LLM provider. Pointbase_url and the project API key at the gateway; the framework doesn't need budget logic in-app.

The real cost of an unbounded agent loop

Let's do the math on a common scenario: a coding agent that uses GPT-4o for multi-step code generation and review.

ScenarioIterationsAvg contextOutput/iterTotal cost
Happy path (succeeds in 5 steps)512K tokens2K tokens$0.25
Moderate loop (stuck for 30 steps)3040K tokens4K tokens$4.20
Runaway loop (100 iterations overnight)10080K tokens4K tokens$24.00
Runaway loop with Opus/o3-pro10080K tokens4K tokens$240+
50 concurrent agents, all runaway100 ea80K tokens4K tokens$1,200+

The jump from “happy path” to “50 concurrent runaways” is three orders of magnitude. And this is one night. Over a month, unbounded agent usage can quietly accumulate five-figure bills before anyone notices — because LLM providers bill monthly and don't alert by default.

Why OpenAI's built-in limits aren't enough

OpenAI offers account-level monthly caps and rate limits. They help, but they're too coarse for agent workloads:

  • One cap for the whole org — you can't give different teams different budgets.
  • No per-session tracking — there's no concept of an “agent session” at the API level.
  • No model-tier restrictions per key — every key can call every model.
  • No rate-of-spend alerting — you find out at end of month.
  • Only works for OpenAI — agents that call multiple providers need cross-provider budget tracking.

The same limitations apply to Anthropic, Google, and every other provider. Budget enforcement is not the LLM provider's job — it's the gateway's job.

Budget enforcement checklist for AI teams

  • Every API key has a monthly dollar cap
  • Dev keys are restricted to mini/haiku-class models only
  • Agent sessions have a per-session token budget (200K-500K tokens typical)
  • Per-request output token ceiling is set (4K-8K typical)
  • Circuit breaker fires when spend rate exceeds 3x trailing average
  • Budget alerts go to Slack/email at 50%, 80%, and 95% of cap
  • Budget enforcement is at the gateway, not in the agent code
  • Cross-provider spend is tracked in a single dashboard

Frequently asked questions

What is an LLM token budget strategy for agents?

An LLM token budget strategy is a set of controls that limit how many tokens (and therefore how much money) an autonomous AI agent can consume per request, per session, or per billing period. The five main strategies are: per-request ceilings, per-session budgets, per-key monthly caps, model-tier routing restrictions, and circuit breakers for anomalous spend rates.

How do I set an LLM spending cap for AI agents?

The most reliable approach is gateway-level enforcement: route your agent's LLM calls through a proxy that tracks cumulative spend per API key and per session, and returns HTTP 429 when a budget ceiling is hit. This works with any agent framework because it operates at the HTTP layer, not inside the agent code.

Why shouldn't I put budget logic in the agent code?

Because buggy agent code can skip its own budget check. If the agent crashes, retries unexpectedly, or has a logic error, the in-code budget guard may never execute. Gateway-level enforcement is external to the agent — the agent literally cannot make an API call that exceeds the budget because the gateway blocks it before forwarding.

What is a good per-session token budget for an AI agent?

For most agent workloads, 200K-500K tokens per session is a reasonable starting point. That covers 15-30 iterations with moderate context growth. Coding agents with large codebases may need 500K-1M. Set the ceiling, monitor actual usage for a week, then adjust. The goal is to catch runaway loops without blocking legitimate deep-reasoning sessions.

Does OpenAI have per-project budget limits?

OpenAI offers account-level monthly usage caps and project-level rate limits. However, they don't provide per-session budgets, per-key dollar caps, model-tier restrictions per key, or rate-of-spend anomaly detection. For fine-grained budget enforcement across projects and agent sessions, you need an LLM gateway with built-in budget management.

How much can a runaway AI agent cost overnight?

A single runaway agent loop running GPT-4o at 80K context per iteration for 100 iterations costs approximately $24. With a premium model like Claude Opus, that becomes $240+. Running 50 concurrent runaway agents — common in batch-processing setups — can accumulate $1,200+ in a single night. Budget enforcement and circuit breakers prevent this.

Ship agents without the bill shock

Per-request token ceilings, wallet-based spending limits, and model-tier restrictions — enforced at the gateway. Integration is two lines: set base_url and your project api_key. Route 300+ models across 9+ providers from one endpoint. Free tier includes 1 million Hub Credits.

The same gateway adds PII redaction (28+ entity types), per-project DLP with configurable BLOCK or REDACT policies, vision/OCR scanning on image payloads, and prompt-injection detection. Traffic is processed statelessly with metadata-only logging; per-project audit logs support compliance. Prefer your own contracts? BYOK is supported.

Related Articles