LLM Budget Enforcement & AI Cost Control
Unpredictable AI billing is the #1 barrier to enterprise LLM adoption. OpenSourceAIHub enforces cost limits at the gateway level β before requests reach providers β with hard quotas, threshold alerts, and automatic runaway-loop protection.
In Development: Early Access Program
Budget Enforcement & Recursive Loop Protection is currently in active development.
Want to help shape the feature or get early access for your team? Join the Phase 1.1 waitlist and we'll notify you as soon as the beta is live. Early access teams will get direct input into the alert thresholds, policy configuration, and dashboard design.
Wallet-Based Budget Enforcement
OpenSourceAIHub already enforces cost limits on every Managed Mode request today. This is the foundation that the Phase 1.1 features (threshold alerts, loop protection, Budget Mode) will build upon. Here's how the current system works:
Pre-Flight Balance Enforcement (Active)
Estimate input cost β The Hub tokenizes the full messages array (the entire conversation history, not just the latest message) and multiplies by the model's per-token input rate.
Check wallet balance β If the estimated cost exceeds the available wallet balance, the request is immediately rejected with a 402 Insufficient Balance error. The provider never sees the request.
Cap output tokens β After reserving the input cost, the Hub calculates the maximum output tokens the remaining balance can cover. If your max_tokens exceeds this, the Hub silently caps it β preventing surprise overspend while still returning a useful response.
Atomic Deductions
Wallet deductions use atomic, transaction-safe operations with distributed locking. You can never be double-charged or spend below zero, even under high concurrency.
Smart Router Cost Optimization
The Hub automatically selects the cheapest available provider for each model family. If Groq serves Llama 3 cheaper than Together.ai today, your request goes to Groq.
{
"error": {
"message": "Insufficient wallet balance. Please top up.",
"type": "insufficient_balance",
"code": 402,
"required_balance": "0.004200",
"current_balance": "0.001000",
"correlation_id": "req_e5f6g7h8"
}
}The response includes both required_balance and current_balance so your application can display a precise top-up prompt or automatically switch to a cheaper model.
Key guarantee: you will never be charged more than your wallet balance. The check happens before the request is forwarded, not after.
What Phase 1.1 Adds
The wallet system above protects individual requests. Phase 1.1 adds project-level governance on top of it:
Why LLM Costs Spiral Quickly
LLM APIs bill per token β roughly ΒΎ of a word. Costs are quoted per million tokens, which sounds cheap until you multiply by the number of developers, requests per day, and average conversation length:
Cost Escalation Example
The difference between models is 20Γ. Without governance, teams default to the most expensive model. Budget enforcement creates natural pressure to use cost-efficient alternatives.
The Four Root Causes of AI Spend Spirals
- β Token pricing is opaque β Each model has different input/output rates. Teams rarely know what a single request actually costs until the invoice arrives.
- β Shared API keys β When 10 developers share one provider key, there's no way to attribute cost to a team, project, or feature.
- β Runaway agents & loops β Retry loops, recursive AI agents, and unbounded batch pipelines can burn millions of tokens in minutes before anyone notices.
- β No built-in audit trail β Most providers log total usage, not per-request cost. Debugging a $500 spike requires guesswork.
The Need for Budget Enforcement
Without a cost-governance layer, most organizations discover the same painful pattern:
No Spending Ceiling
OpenAI, Anthropic, and Google charge per token with no hard stop. A billing alert after the fact doesn't prevent the spend.
No Per-Project Quotas
A single API key serves 5 teams. When the bill spikes, nobody knows which project caused it β and nobody owns the fix.
No Audit Logs
Provider dashboards show aggregate usage. They can't tell you which request cost $2.40 or which engineer triggered the batch job.
The solution is a budget enforcement engine that sits between your application and the LLM provider β checking every request against project-level quotas before it's forwarded. This is exactly what OpenSourceAIHub provides, with three layers of protection that work together.
Budget Enforcement Architecture
Every request flows through a multi-stage pipeline before reaching the LLM provider. The budget engine adds enforcement checks at the gateway level β no provider-side configuration needed:
Budget checks happen before the request is forwarded. The provider never sees a request that exceeds budget.
Project quota check β The engine looks up the project's monthly budget, current spend, and per-request token caps. If the monthly budget is exhausted, the request is rejected immediately.
Loop detection β A sliding-window analyzer checks whether this key has sent repeated or near-identical prompts in the last 60 seconds. If an agent loop is detected, the request is killed.
Pre-flight cost estimate β The engine tokenizes the full messages array and estimates the worst-case cost (input + maximum output). This estimate is checked against the remaining budget.
DLP + forward β If the budget check passes, the request flows through the PII redaction engine and is forwarded to the optimal provider.
Post-inference deduction β After the provider responds, the actual token usage is calculated and deducted atomically from the project's budget. Every deduction is logged with full attribution.
The 3 Layers of AI Cost Control
Effective budget enforcement isn't a single switch β it's a layered defense. Each layer catches a different class of spend risk, from planned overages to catastrophic agent loops.
Hard Quotas
Prevent requests once a limit is reached
Set a monthly dollar or credit ceiling for each project. When the project hits 100% of its budget, the gateway returns HTTP 402 for all subsequent requests β the LLM provider never sees them.
{
"error": {
"message": "Project budget exhausted. Monthly limit: 100,000,000 credits. Used: 100,000,000 credits.",
"type": "budget_exhausted",
"code": 402,
"project_id": "proj_abc123",
"budget_limit": 100000000,
"budget_used": 100000000,
"resets_at": "2026-04-01T00:00:00Z",
"correlation_id": "req_x9y8z7"
}
}What Hard Quotas Enforce
- β’Monthly credit/dollar ceiling per project
- β’Per-request maximum token cap (e.g., 4,096 output tokens)
- β’Pre-flight balance check β rejected before provider sees the request
- β’Atomic, transaction-safe deductions with distributed locking
Soft Alerts & Threshold Notifications
Notify teams before they hit the wall
Hard stops are the last resort. Threshold alerts give engineering leads and FinOps teams early warning so they can investigate and adjust before requests start failing.
| Threshold | Severity | Action |
|---|---|---|
| 50% | Info | Dashboard banner + optional Webhook/Slack notification |
| 80% | Warning | Alert to project owner + optional βBudget Modeβ activation |
| 100% | Critical | Block requests until next cycle or manual override |
Delivery channels: Webhook URL (custom endpoint), Slack (incoming webhook), Dashboard (banner + notification bell). Thresholds are configurable per-project.
Recursive Loop Protection
Identify and kill runaway agents
AI agents that call themselves in loops are the most dangerous cost threat. A single misconfigured ReAct agent can drain an entire wallet in seconds β sending the same prompt hundreds of times before any human notices.
Three Detection Mechanisms
Sliding-window rate detection
If a project key sends > N requests within M seconds with identical or near-identical prompts, the proxy returns HTTP 429 with an agent_loop_detected error.
Token velocity throttling
If token consumption for a single key exceeds a configurable tokens-per-minute ceiling, subsequent requests are throttled or blocked.
Prompt fingerprint deduplication
Hash-based detection of repeated prompt payloads within a short window. Even slight variations in phrasing are caught by fuzzy-matching.
Real-world scenario: A customer support bot using a ReAct framework enters a tool-call loop β calling the same API 200 times in 12 seconds. Without loop protection, this burns ~400K tokens ($3.20 at GPT-4.1 output rates) in under a minute. With loop protection, the Hub kills the loop at request #50 and returns a clear error to the application.
Example Budget Policy
Each project can have its own budget configuration that combines all three layers. Here's a complete example for a customer support bot:
{
"project_id": "proj_customer_support",
"budget": {
"monthly_limit_credits": 100000000,
"per_request_max_tokens": 4096,
"alert_thresholds": [0.50, 0.80, 1.00],
"on_exhausted": "block",
"budget_mode": {
"enabled": true,
"trigger_threshold": 0.80,
"allowed_models": [
"oah/llama-3-70b",
"oah/gpt-4.1-mini",
"oah/gemini-2.5-flash"
]
},
"loop_protection": {
"max_requests_per_window": 50,
"window_seconds": 60,
"max_tokens_per_minute": 500000
},
"notifications": {
"webhook_url": "https://hooks.slack.com/services/...",
"channels": ["webhook", "dashboard"]
}
}
}Monthly Limit
100M credits
~$100/month
Max per Request
4,096 tokens
Output cap enforcement
Loop Threshold
50 req / 60s
Per-key sliding window
Real-Time Cost Tracking
Budget enforcement is only half the picture. You also need visibility into where the money is going. The Hub dashboard provides real-time cost analytics per project:
Token Usage by Model
See exactly which models consume the most tokens. Identify whether GPT-4.1 is being used for tasks that GPT-4.1-mini could handle.
Per-Model Cost Breakdown
Dollar and credit cost per model family, updated in real time after every request. Spot the most expensive models at a glance.
Provider Comparison
Compare cost-per-request across providers for the same model. See if Groq is serving Llama 3 cheaper than Together.ai today.
Daily Burn Rate
Trend chart extrapolating when the project's budget will be exhausted at current velocity β so you can intervene before it happens.
Cost Visibility Response Headers
Every successful response includes headers for programmatic cost tracking:
x-hub-scan-msDLP/firewall scanning time (milliseconds)x-hub-violationsDetected PII entity types (e.g., EMAIL_ADDRESS,US_SSN)x-hub-correlation-idUnique request ID for debugging and audit trailx-hub-modelThe actual model used for this requestx-hub-providerThe provider that served the requestx-hub-budget-remainingRemaining project budget credits after this requestx-hub-budget-modeSet to 'active' when the project is in cost-saving Budget ModePreventing Runaway Prompts & Agent Loops
The most expensive AI incidents aren't caused by high usage β they're caused by accidents. Here are the three most common patterns the Hub detects and stops automatically:
Infinite Agent Loops
A ReAct or LangChain agent enters a tool-call cycle, calling the same function and LLM endpoint hundreds of times. Without intervention, a single loop can generate 1M+ tokens in minutes.
Hub protection: The Hub's sliding-window detector identifies the repeated pattern and returns HTTP 429 before the loop drains the budget.
Unbounded Batch Jobs
A script iterates over 10,000 database rows, sending each to GPT-4.1 for classification. The developer intended to test with 10 rows but forgot to add a LIMIT clause.
Hub protection: Per-minute token velocity limits catch the abnormal throughput and throttle the requests, giving the developer time to notice and cancel.
Conversation History Explosion
A chatbot sends the full conversation history on every request. After 50 messages, each request costs 10x the first one β and the user doesn't realize it.
Hub protection: The pre-flight cost estimate flags requests where input tokens alone would consume a disproportionate share of the remaining budget.
The Future: Graceful Downgrade
Policy-Based βBudget Modeβ Routing
Hard-blocking at 100% keeps costs controlled, but it also breaks your application. Our upcoming Graceful Downgrade feature takes a smarter approach: instead of killing requests, the Hub automatically switches to cheaper models to keep your app alive on a tighter budget.
X-Hub-Budget-Mode: active header so your application can adapt its UI (e.g., show a βUsing cost-efficient modelβ badge).Graceful Downgrade is part of Phase 1.1 of our roadmap. Join the early access program above to help shape the implementation.
Per-Million-Token Pricing (Managed Mode)
Understanding per-model costs is critical for setting realistic budgets. Here are the current rates in Managed Mode:
| Model | Input / 1M | Output / 1M |
|---|---|---|
| oah/llama-3-70b | $0.35 | $0.40 |
| oah/gpt-4.1 | $2.00 | $8.00 |
| oah/gpt-4.1-mini | $0.40 | $1.60 |
| oah/claude-sonnet-4.6 | $3.00 | $15.00 |
| oah/gemini-2.5-flash | $0.30 | $2.50 |
| oah/deepseek-r1 | $0.50 | $2.15 |
| oah/mixtral-8x7b | $0.45 | $0.70 |
| oah/grok-3-mini | $0.30 | $0.50 |
Managed Mode applies a 25% markup for open-weight models and 30% for closed-source models over wholesale provider cost. See the full model catalog for 100+ models.
Using OpenSourceAIHub for Cost Control
The Hub offers two billing modes. Both include the full AI Firewall (DLP) security layer at no additional cost.
Managed Mode (Wallet)
- β’Pre-pay credits via Stripe (minimum $5 / 5M credits)
- β’$1.00 = 1,000,000 Hub Credits
- β’Pre-flight balance check on every request
- β’402 rejection when balance is insufficient
- β’Automatic max_tokens capping based on remaining balance
- β’Smart Router selects cheapest available provider
- β’25% markup (open-weight) / 30% markup (closed-source)
BYOK Mode (Bring Your Own Key)
- β’Store your own provider API keys (AES-256-GCM encrypted)
- β’Zero Hub markup β provider bills you directly
- β’Full DLP / AI Firewall protection included free
- β’Budget enforcement via provider-side limits
- β’Supports all 9 providers and 100+ models
- β’Mix with Managed Mode per-provider (hybrid)
- β’Project-scoped keys for team isolation
Hybrid Mode
You can use BYOK for some providers and Managed Mode for others, simultaneously. If you have a Groq API key stored but no Together.ai key, requests routed to Groq use your key (zero Hub cost) while Together.ai requests are deducted from your wallet. The Hub resolves the billing mode automatically per request. Learn more in the OpenAI-compatible proxy guide.
Project-Level Cost Isolation
OpenSourceAIHub supports project-scoped API keys (oah_*) that separate billing and policies by team, application, or environment:
Per-project DLP policies
Each project gets its own DLP policy β a healthcare project blocks SSNs while marketing only logs emails.
Per-project provider keys
Store different BYOK keys per project. Production uses your enterprise OpenAI key; staging uses a shared Groq key.
Per-project usage tracking
Every request is tagged with the project ID. The dashboard shows token usage, cost, and violations per project.
Per-project budget quotas
Set independent monthly budgets for each project. The customer support bot gets $100/month; the internal tool gets $50.
import OpenAI from "openai";
// Project key (oah_*) β scoped to "Customer Support Bot"
const client = new OpenAI({
apiKey: "oah_proj_support_xxxxx",
baseURL: "https://api.opensourceaihub.ai/v1",
});
// This request inherits:
// - The project's monthly budget ($100 / 100M credits)
// - The project's DLP policy (BLOCK SSN, REDACT EMAIL)
// - The project's loop protection settings
// - The project's BYOK keys (if configured)
const response = await client.chat.completions.create({
model: "oah/gpt-4.1-mini",
messages: [{ role: "user", content: "Summarize this patient record..." }],
max_tokens: 512,
});Handling Budget Errors in Application Code
When the pre-flight check determines the budget can't cover the request, the Hub returns a structured JSON error. Your application should handle this gracefully:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "os_hub_your_key_here",
baseURL: "https://api.opensourceaihub.ai/v1",
});
try {
const response = await client.chat.completions.create({
model: "oah/gpt-4.1",
messages: [{ role: "user", content: prompt }],
});
console.log(response.choices[0].message.content);
} catch (err: any) {
if (err.status === 402) {
const detail = err.error;
console.error(
`Budget exceeded: need $${detail.required_balance}, ` +
`have $${detail.current_balance}`
);
// Option A: Retry with a cheaper model (oah/llama-3-70b)
// Option B: Prompt user to top up wallet
// Option C: Queue for later processing
} else if (err.status === 429) {
console.error("Rate limited β possible agent loop detected");
// Back off and investigate
} else {
throw err;
}
}from openai import OpenAI, APIStatusError
client = OpenAI(
api_key="os_hub_your_key_here",
base_url="https://api.opensourceaihub.ai/v1",
)
try:
response = client.chat.completions.create(
model="oah/gpt-4.1",
messages=[{"role": "user", "content": prompt}],
)
print(response.choices[0].message.content)
except APIStatusError as e:
if e.status_code == 402:
print(f"Budget exceeded: {e.body}")
# Fallback to cheaper model or notify user
elif e.status_code == 429:
print("Rate limited β possible agent loop detected")
else:
raiseCost Optimization Strategies
Use the Smart Router
In Managed Mode, the Smart Router automatically selects the cheapest available provider for each model family. If Groq serves Llama 3 70B cheaper than Together.ai today, your request goes to Groq β no code changes needed.
Right-size your models
GPT-4.1-mini ($0.40/1M input) handles most extraction tasks as well as GPT-4.1 ($2.00/1M input). Use the pricing table to match model capability to task complexity.
Set max_tokens explicitly
Always set max_tokens in your requests. Without it, models may generate thousands of tokens for a one-sentence answer. The Hub caps this automatically in Managed Mode, but explicit limits give tighter control.
Truncate conversation history
LLMs re-process the full messages array every request. After 50 messages, each request sends all 50. Trim old messages to keep input tokens β and costs β controlled.
Monitor per-project dashboards
Use project-scoped keys and the Hub dashboard to track which models consume the most tokens. Identify outliers before they become budget crises.
Enable Budget Mode at 80%
Configure Graceful Downgrade so your app stays alive on cheaper models instead of hard-blocking when the budget gets tight. Your users see slower responses, not errors.
Start Enforcing LLM Budgets
Sign up, top up your wallet with $5, and every request is automatically budget-checked from your very first API call. No configuration required. Existing wallet enforcement is live today β advanced features (threshold alerts, loop protection, Budget Mode) are coming in Phase 1.1.
In Development: Early Access Program
Budget Enforcement & Recursive Loop Protection is currently in active development.
Want to help shape the feature or get early access for your team? Join the Phase 1.1 waitlist and we'll notify you as soon as the beta is live. Early access teams will get direct input into the alert thresholds, policy configuration, and dashboard design.
Related Documentation
- AI Gateway with PII Redaction β Protect sensitive data before it reaches LLMs
- OpenAI-Compatible Proxy β Drop-in replacement for the OpenAI SDK
- OpenRouter Alternative β AI Gateway with built-in governance
- Vercel AI Gateway Alternative β Security-first AI routing
- Quickstart β Connect your first application in 2 minutes
- Billing & Wallet Docs β Credit system, top-ups, and deduction mechanics
- Model Catalog β Pricing across 100+ models and 9 providers
- Enterprise Security & Trust Center
- Product Roadmap β Phase 1.1 Budget Enforcement timeline