Skip to content

FinOps for Agents: A Practitioner Playbook for Routing, Gateways, and Prompt Caching

Published: June 2026 | Author: David Daniel

Teaser / companion to the forthcoming paper "Proving the Loop Paid Off: Measuring and Governing the Value of Agentic AI Spend." This article covers the near-term cost-control layer: the operational levers a platform team can pull this quarter. The fuller paper takes up the harder question it sets up: once you have spend under control, how do you measure whether it created value?

The Bill Became the Conversation

A vendor crossing $300 million in annual recurring revenue is not, by itself, news. A vendor crossing $300 million while making cutting your AI budget a major selling point is. That is where Glean is in mid-2026, according to TechCrunch's May 28 reporting: the company has leaned publicly into the pitch that customers are blowing through their AI budgets and that cutting that token spend is a service they will pay for. Two caveats up front, because this article holds itself to public sources: the ARR figure and the budget-cutting pitch are the company's public positioning as reported by the press, not an audited financial analysis, and one data point is not a market. But when a fast-growing enterprise AI vendor decides the winning sales angle is we'll help you spend less on AI, that tells you something about what its customers are asking for.

Here is the structural read, and I want to be explicit that this is the article's analytical framing, not a reported statistic. Chat-style AI assistance has a natural cost governor: a human can only type so fast. An autonomous agent has no such governor. A long-running agent loop re-reads its instructions, its tool definitions, its retrieved context, and its accumulated working state on every turn, for as many turns as the task takes, with nobody watching the meter. Spend stops scaling with the number of people you've licensed and starts scaling with how much autonomy you've granted. That is a variable cost with a shape that seat-based budgeting was never designed to model, and it is why "the AI line item is up again" lands on platform teams as an operational emergency rather than a procurement question.

The instinctive response is to reach for the hard question first: is this spend worth it? That question (value attribution for agentic work, why single-metric ROI misleads, how to build measurement you can trust) is genuinely hard, genuinely contested, and the subject of the companion paper. But there is a more immediate move available, and it does not require resolving the ROI debate first: make the spend observable and controllable. That is FinOps applied to agentic workloads. The term is borrowed deliberately: FinOps is the established discipline of operationalizing cloud cost management, and extending it to agents is an emerging framing that others in the industry are converging on too. InfoWorld and FinOps-tooling vendors such as Finout have both published under the same banner; this article doesn't claim the coinage, only the playbook. The argument here is narrow and practical: the near-term defense against runaway agent spend is operational, not architectural. You do not need to re-platform. You need four levers, all of them public, documented, and shipping today, starting with the bluntest one.

Lever 0. Spend Caps: Turn On the Brakes Before You Tune the Engine

Before the cleverer levers, the blunt one: hard budget caps, set at the platform level, switched on first. This is lever zero because it is the only control that works before you have telemetry (a cap does not need to know which agent burned the tokens in order to stop the burn) and because the platforms are now shipping it natively. GitHub's April 27, 2026 announcement moving Copilot to usage-based billing ships the controls in the same release as the meter: "Admins will also have new budget controls. They will be able to set budgets at the enterprise, cost center, and user levels. When the included pool is exhausted, organizations can choose whether to allow additional usage at published rates or cap spend." That is the vendor's own announcement describing its own feature, and should be weighed as such, but the practitioner move does not depend on taking the framing at face value: if a platform you already pay for exposes a budget object, create it, set it below the number that would trigger an emergency meeting, and only then start optimizing.

The case for doing this now rather than next quarter is that uncapped meters bite fast, at every company size. At the large end, Uber reportedly burned through its entire planned 2026 budget for agentic AI coding tools (Claude Code and Cursor among them) within the first four months of the year, and responded with a hard cap of $1,500 per employee per month per tool. The attribution chain matters, so here it is in full: the budget exhaustion is what Uber's CTO, Praveen Neppalli Naga, told The Information in April; the $1,500 cap comes from Bloomberg's reporting; and both reach this article via Yahoo Finance's syndicated write-up. Every figure in this paragraph is second-hand press reporting and should be read as reported, not audited. At the individual-developer end, the first week of GitHub's metered billing produced exactly the failure mode caps exist for: Visual Studio Magazine's June 4 piece "Copilot Billing Shock Hits Developers" documents users exhausting most of a month's credit allowance within days of token billing taking effect June 1. Neither story is an argument against usage-based pricing. Both are evidence (one reported, one documented in trade press) that the default state of a metered AI platform with no cap is a surprise invoice.

One adjacent lever is emerging in the same governance conversation and is worth naming with its risk attached: subscription arbitrage through open harnesses. OpenCode, an MIT-licensed agent harness (about 169,000 GitHub stars as of June 9, 2026; a volatile point-in-time count, cited for scale only), lets paid Copilot subscribers drive it with the subscription they already hold, under an official partnership. GitHub's own changelog (January 16, 2026) puts it as "no additional AI license needed." The same project sells gateway model access "at cost" and a $10/month plan on open models ($5 for the first month). Both characterizations are from OpenCode's own pages, vendor self-description as of June 2026, and "cheaper" remains community framing, not an independently verified saving. The reason this belongs under governance rather than under routing is the failure mode: the lever can be revoked from the model vendor's side. Anthropic moved in early 2026 to block Claude Pro/Max subscriptions from third-party harnesses, OpenCode included, as The Register reported on February 20, 2026. Treat arbitrage as an opportunistic discount, not a load-bearing budget line.

Lever 1. Prompt Caching: The Most Verifiable Cut Available

With the caps on, start the optimization work with prompt caching, because it is the one lever whose economics rest on an official published number rather than a vendor case study or a modeled estimate. Agentic workloads have a distinctive traffic shape: they re-send enormous, near-identical context (system prompt, tool definitions, schemas, file trees, retrieved documents) turn after turn after turn. Caching exists precisely to stop paying full price for that repetition.

Anthropic's prompt-caching documentation makes the pricing concrete: cache reads are billed at 0.1× the base input-token price (roughly ten times cheaper than sending the same tokens fresh), while writing to the cache carries a premium: 1.25× base for a 5-minute cache lifetime, 2× for a 1-hour lifetime (multipliers current in the live docs as of June 2026; check the docs for model-specific rates before budgeting against them).

The break-even arithmetic falls straight out of those multipliers: what follows is derived from the published pricing, not a separately sourced claim. Compare caching a stable prefix against re-sending it uncached on every turn. With a 5-minute cache, the first turn costs 1.25× instead of 1×, and every subsequent turn costs 0.1× instead of 1×. After a single cache hit the cached path is already cheaper (1.35× total versus 2× uncached); every turn after that widens the gap by 0.9× of the prefix cost. A 1-hour cache write, at 2×, breaks even on the second hit. In other words, for any prefix that survives even two turns, the write premium is noise. A multi-turn agent re-reads its prefix dozens or hundreds of times per task, which is why, of the levers in this playbook, this is the one whose payoff you can compute on the back of an envelope before deploying anything. The real-world numbers will differ from the idealized arithmetic (caches expire, prefixes churn, not every token is cacheable), but the direction and rough magnitude are fixed by the published multipliers.

The practitioner discipline that makes the lever pay is prefix stability. Caching rewards context that is identical, byte for byte, across calls. That means structuring agent prompts so the stable material (instructions, tool definitions, schemas, long reference documents) sits at the front and never varies mid-task, while the volatile material (the current turn, fresh tool output) is appended after it. Teams that interleave a timestamp, a dynamic ID, or a re-ordered tool list into the prefix silently forfeit the cache on every call and pay full freight without noticing. Treat the prompt prefix the way you treat a hot code path: stable, reviewed, and deliberately ordered. This is also the cheapest lever organizationally: it requires no new infrastructure, no procurement, and no traffic re-routing. It is a prompt-engineering and client-configuration change, deployable per-agent, this sprint.

Lever 2. LLM Gateways: Spend You Can See Is Spend You Can Govern

Caching cuts the bill. It does nothing to tell you who is running it up. An agentic estate of any size (multiple teams, multiple agents, multiple providers) produces AI traffic that is, by default, opaque: API keys shared across services, token consumption visible only as a monthly invoice, and no chokepoint where policy can be enforced before a call leaves the building. The second lever is putting a gateway in that path.

Two shipping products serve as concrete reference points for what the layer looks like in practice. Tailscale Aperture, in public beta, is the cost-governance example: it sits between your workloads and the model providers and applies quotas per provider, per model, and per identity, with usage monitoring and guardrail hooks that run before the LLM call is made. Google Cloud Model Armor covers the adjacent security face of the same chokepoint: screening prompts and responses in flight. Google's release notes list Model Armor's integration with the Gemini Enterprise Agent Platform as generally available (December 3, 2025), while the separate integration with that platform's Agent Gateway is marked Preview in Google's own documentation as of June 2026; the two scopes are distinct, and only the former is GA. The two products are not interchangeable (Aperture is the budget-and-quota story, Model Armor the guardrail story), but together they illustrate the category: an enforcement point between agents and models where traffic becomes loggable, attributable, and subject to policy. (Vendor capability claims here are drawn from the vendors' own announcement and product pages; weigh them accordingly.)

For a platform team, the gateway is the precondition for everything downstream of the caps, because it converts "the AI bill is up" from a finance-team lament into an engineering signal. Once every model call flows through a chokepoint that knows which identity, which agent, which model, how many tokens, you can do the things FinOps actually consists of: per-team chargeback, per-agent budgets, anomaly alerts when a loop starts consuming tokens at 3 a.m., and hard quotas that stop a runaway agent at the limit instead of at the invoice. The agentic failure mode that makes this urgent is specific: agents are loops, and loops can run away. A retry storm, a degenerate planning cycle, an agent that keeps re-reading a million-token context because nobody capped it: without a metering chokepoint, the first sign of any of these is the bill. With one, it's an alert, then a quota, then a Tuesday.

The sequencing point matters: deploy the gateway before you negotiate routing policy or argue about model tiers, because every downstream decision depends on data only the gateway can produce. You cannot route what you cannot measure, and you cannot budget what you cannot attribute.

Lever 3. Tiered Routing: A Real Pattern, Not a Benchmarked Number

The third lever is routing: sending most work to cheaper or smaller models and reserving frontier models for the steps that genuinely need frontier capability. The logic is straightforward. An agent task is not a uniform stream of equally hard tokens; it mixes trivial steps (formatting output, extracting a field, summarizing a tool result) with genuinely hard ones (planning, debugging, judgment calls), and paying frontier prices for the trivial steps is pure waste. Once a gateway gives you per-step visibility, routing is the natural next move, and it is the pattern the gateway products themselves are built to support: per-model quotas and per-model policy are routing's enforcement half. Nor is the pattern merely folk practice; it ships as a documented product feature: Amazon Bedrock's Intelligent Prompt Routing routes each request between models in a family to the cheapest one predicted to handle it, with AWS quoting savings of "up to 30%", a vendor figure, cited here as evidence the practice exists, not as a number to budget against. The tier structure routing exploits is explicit on the vendors' own price sheets: Anthropic's current line, for instance, lists Haiku, Sonnet, and Opus at $1, $3, and $5 per million input tokens (with output at five times input on each tier, and the premium Fable 5 tier at double Opus), so the cost of sending a trivial step to a frontier tier is a published, knowable multiple rather than a guess.

Now the caution, stated as plainly as the lever itself, because this is where cost-optimization content tends to go soft. Specific routing ratios circulate in vendor decks and conference talks: "60/30/10" splits across model tiers, named tiering schemes with confident savings percentages attached. Those ratios are illustrative patterns, not benchmarked results. No public benchmark establishes a universal split, the right mix depends entirely on your task distribution, and any savings figure quoted without your workload behind it is a hypothesis wearing a percentage. This article carries exactly one set of verifiable cost numbers, and they are the caching multipliers in Lever 1. The routing lever's economics are real but local to you, and they have to be measured, not copied.

The practitioner version of the lever is therefore a measurement loop, not a configuration value. Classify the step types your agents actually execute (the gateway logs give you this). Route the cheapest-plausible model at each class. Hold output quality constant with the verification you already run (tests, schema checks, evals) and watch two numbers: cost per completed task and failure/retry rate. A cheaper model that fails more can cost more end-to-end than the frontier model it replaced, because every retry re-runs the loop, re-bills the prefill, and re-consumes the very tokens you were trying to save. Routing is an architecture you adopt and then tune against your own task mix; treat any pre-baked ratio as a starting hypothesis at most.

Underneath Levers 1 Through 3: Prefill and the KV-Cache

The three optimization levers look unrelated (a pricing feature, a network chokepoint, an architecture pattern), but they are pulling on the same underlying physics, and knowing the mechanics helps you predict which lever pays where.

Transformer inference has two phases with very different cost profiles: prefill, where the model processes the entire input context before producing anything, and decode, where it generates output tokens one at a time. Processing a long context is compute-intensive, and the KV-cache, the model's working memory of that processed context, occupies GPU memory in competition with everything else the hardware is doing. For agentic workloads, where each turn ships a long and largely repeated context, the input side of the ledger is where the tokens concentrate.

The hardware market is treating this split as real enough to build silicon around. NVIDIA's Rubin CPX, as covered by The Register in September 2025, is a context-optimized GPU aimed specifically at the prefill phase, disaggregating compute-bound context processing from generation so that high-bandwidth-memory GPUs can be reserved for decode. You do not need to buy the silicon, or track where the part lands in NVIDIA's roadmap, to extract the practitioner insight: when the dominant inference vendor designs a dedicated chip for processing input context, that is a strong signal about where long-context costs live.

And it closes the loop on this playbook's priorities. If the expensive phase of an agentic turn is re-processing a long, mostly repeated input, then the software change that buys you the most is to stop re-processing it, which is exactly what prompt caching does, which is why Lever 1 comes first, and why its discount (0.1× for a cache read) is as steep as it is. The gateway tells you where the long-context turns are happening; routing decides which model's prefill you pay for; caching makes sure you pay for the repeated part once. Three levers, one cost structure.

What This Buys You (and What It Doesn't)

Everything above is deployable this quarter. None of it requires re-architecting your agents, renegotiating contracts, or waiting on a vendor roadmap: spend caps are settings in admin consoles you already control, caching is a prompt-structure and client change, a gateway is an infrastructure deployment with shipping products to choose from, and routing is policy on top of the gateway's data. The honest hierarchy of evidence runs the same direction as the recommended order of adoption: the caching economics are official published pricing; the gateway capabilities are vendor-documented, shipping features; routing is a vendor-documented, shipping pattern whose savings are unbenchmarked and have to be established against your own workload.

What the four levers buy you is spend that is capped (a hard ceiling that holds even before the telemetry exists), observable (every call logged and attributed), controllable (quotas, budgets, alerts, guardrails at a chokepoint), and reducible (caching and routing attacking the cost structure where it actually lives). That is the FinOps foundation, and it is worth having on its own terms: it converts an open-ended liability into a managed budget line.

There is also a longer arc worth naming, because it locates where this playbook sits. For AI-native products, inference spend lands in cost of goods sold, which makes runaway token consumption a gross-margin problem rather than a budgeting nuisance, and a vendor category is forming around exactly that gap. Frugal, a cost-engineering startup that announced a $5 million seed round in November 2025, positions itself against traditional FinOps tooling on the argument that visibility-layer cost management (showback, chargeback, rate negotiation) tells you who spent, while the remaining gap is reducing what the code itself consumes; in CEO Mike Weider's words, "Cloud and AI costs are quietly eroding software margins." That is a vendor's launch positioning, cited as a signal of where the discipline is heading, not as an endorsement. The reason it belongs here is the sequencing it implies: the four levers in this playbook are the platform-level layer a team can deploy now, and code-level consumption reduction (of which Lever 1's prefix discipline is the first concrete instance) is where the practice goes next.

What it explicitly does not buy you is an answer to whether the spend was worth it. A 40-percent-cheaper agent that ships nothing of value is a smaller waste, not a win; an expensive agent that reliably ships real outcomes may be the bargain of the budget. Cost control and value measurement are different problems, and conflating them is how organizations end up optimizing the denominator of an ROI fraction whose numerator they never measured. The measurement problem (why single-metric ROI misleads for agentic work, why perceived gains diverge from measured output, and how to build value attribution you can defend) is the subject of the companion paper. The sequencing this article argues for is the one a platform team can actually execute: get the cost layer observable and governed first, with the levers that exist today. Then measure what the spend actually did.

Companion paper (in progress): "Proving the Loop Paid Off: Measuring and Governing the Value of Agentic AI Spend."


Sources and AI assistance. This article was drafted with AI assistance and verified by the author; every load-bearing claim maps to a public source. Volatile figures (prices, plans, star counts) are dated as of June 2026 and should be re-checked before relying on them. Source pack: Anthropic prompt-caching documentation (official docs; multipliers verified June 2026) · GitHub, "GitHub Copilot is moving to usage-based billing" (official, Apr 27, 2026) · GitHub Changelog, "GitHub Copilot now supports OpenCode" (official, Jan 16, 2026) · Yahoo Finance syndicated reporting on Uber's AI-tool caps, citing The Information and Bloomberg (press, second-hand) · Visual Studio Magazine, "Copilot Billing Shock Hits Developers" (trade press, Jun 4, 2026) · OpenCode Zen and Go documentation (vendor self-description) · The Register on Anthropic's third-party-harness restriction (Feb 20, 2026) and on NVIDIA Rubin CPX (Sep 10, 2025) · Tailscale Aperture beta announcement (vendor, Apr 23, 2026) · Google Cloud Model Armor product page, Security Command Center release notes, and Model Armor–Agent Gateway integration documentation (vendor/official) · Amazon Bedrock Intelligent Prompt Routing documentation (vendor docs) · TechCrunch on Glean's ARR and positioning (press, May 28, 2026) · Anthropic model price sheet (official docs; tier rates verified June 2026) · Frugal seed-round announcement via Business Wire (vendor launch positioning, Nov 12, 2025) · InfoWorld and Finout for the pre-existing "FinOps for agents" framing (context, non-load-bearing). Routing ratios such as "60/30/10" are illustrative patterns, not benchmarks.

Released under the MIT License.