FinOps for Agents: A Practitioner Playbook for Routing, Gateways, and Prompt Caching

Published: June 2026 | Author: David Daniel

Teaser / companion to the forthcoming paper "Proving the Loop Paid Off: Measuring and Governing the Value of Agentic AI Spend." This article covers the near-term cost-control layer: the operational levers a platform team can pull this quarter. The fuller paper takes up the harder question it sets up: once you have spend under control, how do you measure whether it created value?

The Bill Became the Conversation

A vendor crossing $300 million in annual recurring revenue is not, by itself, news. A vendor crossing $300 million while making cutting your AI budget a major selling point is. That is where Glean is in mid-2026, according to TechCrunch's May 28 reporting. The company has leaned publicly into the pitch that customers are blowing through their AI budgets and that cutting that token spend is a service they will pay for.

The ARR figure and the budget-cutting pitch are the company's public positioning as reported by the press, not an audited financial analysis. Glean is a single data point. Still, when a fast-growing enterprise AI vendor decides the winning sales angle is we'll help you spend less on AI, that tells you something about what its customers are asking for.

There is a structural reason to expect this (framing, not a reported statistic). Chat-style AI assistance has a natural cost governor: a human can only type so fast. An autonomous agent has no such governor. A long-running agent loop re-reads its instructions, its tool definitions, its retrieved context, and its accumulated working state on every turn, for as many turns as the task takes. Nobody is watching the meter.

Spend stops scaling with the number of people you've licensed and starts scaling with how much autonomy you've granted. Seat-based budgeting was never designed to model a variable cost with that shape. It is why "the AI line item is up again" lands on platform teams as an operational emergency rather than a procurement question.

The instinctive response is to reach for the hard question first: is this spend worth it? That question covers value attribution for agentic work, why single-metric ROI misleads, and how to build measurement you can trust. It is genuinely hard, genuinely contested, and the subject of the companion paper. But there is a more immediate move available, and it does not require resolving the ROI debate first: make the spend observable and controllable. That is FinOps applied to agentic workloads.

The term is borrowed deliberately. FinOps is the established discipline of operationalizing cloud cost management, and extending it to agents is a framing others in the industry are converging on too. InfoWorld and FinOps-tooling vendors such as Finout have both published under the same banner. This article doesn't claim the coinage, only the playbook.

The argument here is narrow and practical: the near-term defense against runaway agent spend is operational, not architectural. You do not need to re-platform. You need four levers, all of them public, documented, and shipping today. Start with the bluntest one.

Lever 0. Spend Caps: Turn On the Brakes Before You Tune the Engine

Before the cleverer levers, the blunt one: hard budget caps, set at the platform level, switched on first. This is lever zero for two reasons. It is the only control that works before you have telemetry, because a cap does not need to know which agent burned the tokens in order to stop the burn. And the platforms are now shipping it natively.

GitHub's April 27, 2026 announcement moving Copilot to usage-based billing ships the controls in the same release as the meter: "Admins will also have new budget controls. They will be able to set budgets at the enterprise, cost center, and user levels. When the included pool is exhausted, organizations can choose whether to allow additional usage at published rates or cap spend." That is a vendor describing its own feature, but the practitioner move does not depend on taking the framing at face value. If a platform you already pay for exposes a budget object, create it. Set it below the number that would trigger an emergency meeting. Only then start optimizing.

The case for doing this now rather than next quarter: uncapped meters bite fast, at every company size. At the large end, Uber reportedly burned through its entire planned 2026 budget for agentic AI coding tools (Claude Code and Cursor among them) within the first four months of the year. It responded with a hard cap of $1,500 per employee per month per tool. The budget exhaustion is what Uber's CTO, Praveen Neppalli Naga, told The Information in April. The $1,500 cap comes from Bloomberg's reporting, and both reach here via Yahoo Finance's syndicated write-up. Every figure here is second-hand press, reported rather than audited.

At the individual-developer end, the first week of GitHub's metered billing produced exactly the failure mode caps exist for. Visual Studio Magazine's June 4 piece "Copilot Billing Shock Hits Developers" documents users exhausting most of a month's credit allowance within days of token billing taking effect June 1. Both stories show what a metered AI platform does by default with no cap in place: spend runs until the invoice arrives (one case reported, one documented in trade press). That is a case for caps, and no knock on usage-based pricing.

One adjacent lever is emerging in the same governance conversation, and it carries a named risk: subscription arbitrage through open harnesses. OpenCode is an MIT-licensed agent harness with about 169,000 GitHub stars as of June 9, 2026 (a volatile count, cited for scale only). Under an official partnership, paid Copilot subscribers can drive it with the subscription they already hold. GitHub's own changelog (January 16, 2026) puts it as "no additional AI license needed." The same project sells gateway model access "at cost" and a $10/month plan on open models ($5 for the first month). Both characterizations are from OpenCode's own pages as of June 2026, and "cheaper" remains community framing, not an independently verified saving.

This belongs under governance rather than routing because of the failure mode: the lever can be revoked from the model vendor's side. Anthropic moved in early 2026 to block Claude Pro/Max subscriptions from third-party harnesses, OpenCode included, as The Register reported on February 20, 2026. Treat arbitrage as an opportunistic discount that can be revoked out from under you.

Lever 1. Prompt Caching: The Most Verifiable Cut Available

With the caps on, start the optimization work with prompt caching. It is the one lever whose economics rest on an official published number rather than a vendor case study or a modeled estimate. Agentic workloads have a distinctive traffic shape: they re-send enormous, near-identical context (system prompt, tool definitions, schemas, file trees, retrieved documents) turn after turn after turn. Caching exists precisely to stop paying full price for that repetition.

Anthropic's prompt-caching documentation puts exact numbers on the pricing. Cache reads are billed at 0.1× the base input-token price, roughly ten times cheaper than sending the same tokens fresh. Writing to the cache carries a premium: 1.25× base for a 5-minute cache lifetime, 2× for a 1-hour lifetime. Those multipliers are current in the live docs as of June 2026. Check the docs for model-specific rates before budgeting against them.

The break-even arithmetic falls straight out of those multipliers (derived from the published pricing, not a separately sourced claim). Compare caching a stable prefix against re-sending it uncached on every turn. With a 5-minute cache, the first turn costs 1.25× instead of 1×, and every subsequent turn costs 0.1× instead of 1×. After a single cache hit the cached path is already cheaper: 1.35× total versus 2× uncached. Every turn after that widens the gap by 0.9× of the prefix cost. A 1-hour cache write, at 2×, breaks even on the second hit.

For any prefix that survives even two turns, the write premium is noise. A multi-turn agent re-reads its prefix dozens or hundreds of times per task. That is why this is the one lever whose payoff you can compute on the back of an envelope before deploying anything. The real-world numbers will differ from the idealized arithmetic (caches expire, prefixes churn, not every token is cacheable), but the direction and rough magnitude are fixed by the published multipliers.

The practitioner discipline that makes the lever pay is prefix stability. Caching rewards context that is identical, byte for byte, across calls. Structure agent prompts so the stable material (instructions, tool definitions, schemas, long reference documents) sits at the front and never varies mid-task, and append the volatile material (the current turn, fresh tool output) after it. Teams that interleave a timestamp, a dynamic ID, or a re-ordered tool list into the prefix silently forfeit the cache on every call and pay full freight without noticing. Treat the prompt prefix the way you treat a hot code path: stable, reviewed, and deliberately ordered.

This is also the cheapest lever organizationally. It requires no new infrastructure, no procurement, and no traffic re-routing. It is a prompt-engineering and client-configuration change, deployable per-agent, this sprint.

Lever 2. LLM Gateways: Spend You Can See Is Spend You Can Govern

Caching cuts the bill. It does nothing to tell you who is running it up. An agentic estate of any size (multiple teams, multiple agents, multiple providers) produces AI traffic that is opaque by default: API keys shared across services, token consumption visible only as a monthly invoice, and no chokepoint where policy can be enforced before a call leaves the building. The second lever is putting a gateway in that path.

Two shipping products show what the layer looks like in practice. Tailscale Aperture, in public beta, is the cost-governance example. It sits between your workloads and the model providers and applies quotas per provider, per model, and per identity, with usage monitoring and guardrail hooks that run before the LLM call is made. Google Cloud Model Armor covers the adjacent security face of the same chokepoint: screening prompts and responses in flight.

Google's release notes list Model Armor's integration with the Gemini Enterprise Agent Platform as generally available (December 3, 2025). The separate integration with that platform's Agent Gateway is marked Preview in Google's own documentation as of June 2026. The two scopes are distinct, and only the former is GA.

The two products are not interchangeable: Aperture is the budget-and-quota story, Model Armor the guardrail story. Together they illustrate the category: an enforcement point between agents and models where traffic becomes loggable, attributable, and subject to policy. (Capability claims are from the vendors' own announcement and product pages.)

For a platform team, the gateway is the precondition for everything downstream of the caps. It converts "the AI bill is up" from a finance-team lament into an engineering signal. Once every model call flows through a chokepoint, you know which identity, which agent, which model, how many tokens. That enables the things FinOps actually consists of: per-team chargeback, per-agent budgets, anomaly alerts when a loop starts consuming tokens at 3 a.m., and hard quotas that stop a runaway agent at the limit instead of at the invoice.

The agentic failure mode that makes this urgent is specific: agents are loops, and loops can run away. A retry storm, a degenerate planning cycle, an agent that keeps re-reading a million-token context because nobody capped it. Without a metering chokepoint, the first sign of any of these is the bill. With one, it's an alert, then a quota, then a Tuesday.

Sequencing matters here too: deploy the gateway before you negotiate routing policy or argue about model tiers. Every downstream decision depends on data only the gateway can produce. Routing needs measurement, and budgets need attribution.

Lever 3. Tiered Routing: A Real Pattern, Not a Benchmarked Number

The third lever is routing: send most work to cheaper or smaller models, and reserve frontier models for the steps that genuinely need frontier capability. The logic is straightforward. An agent task is not a uniform stream of equally hard tokens. It mixes trivial steps (formatting output, extracting a field, summarizing a tool result) with genuinely hard ones (planning, debugging, judgment calls), and paying frontier prices for the trivial steps is pure waste.

Once a gateway gives you per-step visibility, routing is the natural next move. It is the pattern the gateway products themselves are built to support: per-model quotas and per-model policy are routing's enforcement half. It also ships as a documented product feature, not just folk practice. Amazon Bedrock's Intelligent Prompt Routing routes each request between models in a family to the cheapest one predicted to handle it, with AWS quoting savings of "up to 30%" (a vendor figure; verify it against your own workload).

The tier structure routing exploits is explicit on the vendors' own price sheets. Anthropic's current line, for instance, lists Haiku, Sonnet, and Opus at $1, $3, and $5 per million input tokens, with output at five times input on each tier and the premium Fable 5 tier at double Opus. The cost of sending a trivial step to a frontier tier is a published, knowable multiple rather than a guess.

This is where cost-optimization content tends to go soft. Specific routing ratios circulate in vendor decks and conference talks: "60/30/10" splits across model tiers, named tiering schemes with confident savings percentages attached. Those ratios are illustrative patterns, not benchmarked results.

No public benchmark establishes a universal split. The right mix depends entirely on your task distribution, and any savings figure quoted without your workload behind it is a hypothesis you still have to test. The only verifiable cost numbers in this playbook are the caching multipliers in Lever 1. The routing lever's economics are real but local to you. They have to be measured, not copied.

The practitioner version of the lever is therefore a measurement loop. Classify the step types your agents actually execute (the gateway logs give you this). Route the cheapest-plausible model at each class. Hold output quality constant with the verification you already run (tests, schema checks, evals) and watch two numbers: cost per completed task and failure/retry rate.

A cheaper model that fails more can cost more end-to-end than the frontier model it replaced, because every retry re-runs the loop, re-bills the prefill, and re-consumes the very tokens you were trying to save. Routing is an architecture you adopt and then tune against your own task mix. Treat any pre-baked ratio as a starting hypothesis at most.

The second caution is about who should build the routing layer at all. The model vendors now ship routing inside their own products, and they iterate on it at a pace an internal policy cannot track. Claude Code's opusplan setting is tiered routing as a single configuration word: it runs Opus while planning and switches to Sonnet for execution (Claude Code model configuration, official docs).

Claude Fable 5, released June 9, 2026, goes further. When its safety classifiers flag a request, it re-runs automatically on Opus 4.8 with a notice in the transcript. Anthropic bills the rerouted input tokens at cache-read rates, 10% of the base input price, as documented in June 2026 (Fable 5 fallback and billing guide, official docs). That discount lives inside Anthropic's billing, so a third-party gateway cannot replicate it.

It is also a moving target. Fable 5 shipped with a separate safeguard that silently degraded responses for users it suspected of frontier AI development work. Two days after launch, under public criticism, Anthropic reversed it: flagged requests now fall back visibly to Opus 4.8, and the company called the silent version "the wrong tradeoff" (Decrypt, June 11, 2026, press). The visible behavior now appears in the model configuration docs cited above. A routing policy an internal team had written in May would have been stale twice by mid-June: once when Fable 5 shipped a new top tier, and again when its fallback behavior changed.

The asymmetry behind both examples is structural. A model vendor's margin depends on delivering the most output per unit of compute (an incentive reading, not a vendor claim), so routing, caching defaults, and adaptive reasoning (the model deciding per step how much to think) are core product surface that vendors work on continuously. For a platform team they are a side project, maintained between other priorities.

So unless you are an AI-native tooling company whose product is the routing layer, the working default is to consume the routing the vendors ship: harness model aliases, plan/execute splits, Bedrock's Intelligent Prompt Routing. The gateway keeps the jobs only it can do: visibility, attribution, and enforcement.

The OpenCode case above is the same lesson from the access side. A lever that lives on the vendor's side of the line moves on the vendor's schedule, in features, in prices, and in whether it stays available to you at all.

One framing rule keeps this lever from becoming a liability: route against cost per completed task at constant quality, never against tokens consumed. The two metrics look interchangeable and behave oppositely. Cost per completed task is closed-loop: if a downgraded model degrades output, the verification you already run catches it, retries fire, and the metric itself punishes the downgrade. Raw token consumption is open-loop: it improves when you skip verification passes and trim the context that makes agents reliable, which is to say it improves fastest when the work is quietly getting worse.

An organization that adopts this lever to satisfy a token-reduction mandate has taken the architecture and discarded its control system. The companion measurement work takes this failure mode up directly.

Underneath Levers 1 Through 3: Prefill and the KV-Cache

The three optimization levers look unrelated: a pricing feature, a network chokepoint, an architecture pattern. But they are pulling on the same underlying physics, and knowing the mechanics helps you predict which lever pays where.

Transformer inference has two phases with very different cost profiles. In prefill, the model processes the entire input context before producing anything. In decode, it generates output tokens one at a time. Processing a long context is compute-intensive, and the KV-cache, the model's working memory of that processed context, occupies GPU memory in competition with everything else the hardware is doing. For agentic workloads, where each turn ships a long and largely repeated context, the input side of the ledger is where the tokens concentrate.

The hardware market is treating this split as real enough to build silicon around. NVIDIA's Rubin CPX, as covered by The Register in September 2025, is a context-optimized GPU aimed specifically at the prefill phase. It disaggregates compute-bound context processing from generation so that high-bandwidth-memory GPUs can be reserved for decode. You do not need to buy the silicon, or track where the part lands in NVIDIA's roadmap, to extract the practitioner insight. When the dominant inference vendor designs a dedicated chip for processing input context, that is a strong signal about where long-context costs live.

And it closes the loop on this playbook's priorities. If the expensive phase of an agentic turn is re-processing a long, mostly repeated input, the software change that buys you the most is to stop re-processing it. That is exactly what prompt caching does. It is why Lever 1 comes first, and why its 0.1× cache-read discount is as steep as it is. The gateway tells you where the long-context turns are happening. Routing decides which model's prefill you pay for. Caching makes sure you pay for the repeated part once. Three levers, one cost structure.

What This Buys You (and What It Doesn't)

Everything above is deployable this quarter. None of it requires re-architecting your agents, renegotiating contracts, or waiting on a vendor roadmap. Spend caps are settings in admin consoles you already control. Caching is a prompt-structure and client change. A gateway is an infrastructure deployment with shipping products to choose from. Routing is policy on top of the gateway's data, and most of it should be vendor routing you consume rather than a layer you build.

The hierarchy of evidence runs the same direction as the recommended order of adoption. The caching economics are official published pricing. The gateway capabilities are vendor-documented, shipping features. Routing is a vendor-documented, shipping pattern whose savings are unbenchmarked and have to be established against your own workload.

The four levers buy you a hard ceiling that holds before the telemetry exists, every call logged and attributed, quotas and alerts enforced at a chokepoint, and caching and routing attacking the cost structure where it actually lives. That is the FinOps foundation, and it is worth having on its own terms: it converts an open-ended liability into a managed budget line.

For AI-native products, inference spend lands in cost of goods sold. That makes runaway token consumption a gross-margin problem rather than a budgeting nuisance, and at least one startup is now selling against exactly that gap. Frugal, a cost-engineering startup that announced a $5 million seed round in November 2025, positions itself against traditional FinOps tooling. Its argument: visibility-layer cost management (showback, chargeback, rate negotiation) tells you who spent, and the remaining gap is reducing what the code itself consumes. In CEO Mike Weider's words, "Cloud and AI costs are quietly eroding software margins." That is a vendor's launch positioning, a signal of where the discipline is heading, not an endorsement.

The reason it belongs here is the sequencing it implies. The four levers in this playbook are the platform-level layer a team can deploy now. Code-level consumption reduction is where the practice goes next, and Lever 1's prefix discipline is its first instance.

What it explicitly does not buy you is an answer to whether the spend was worth it. A 40-percent-cheaper agent that ships nothing of value is a smaller waste, not a win. An expensive agent that reliably ships real outcomes may be the bargain of the budget. Cost control and value measurement are different problems, and conflating them is how organizations end up optimizing the denominator of an ROI fraction whose numerator they never measured. Lever 3 is where that mistake gets made first, which is why its metric rule (cost per completed task, never raw tokens) is the one sentence in this playbook that must survive any summary of it.

The measurement problem is the subject of the companion paper: why single-metric ROI misleads for agentic work, why perceived gains diverge from measured output, and how to build value attribution you can defend. One framing note for that work: "what did the spend create" prices an input. The better question is what work was delivered and how fast, against what the same delivery used to cost and take. The sequencing argued for here is the one a platform team can actually execute. Get the cost layer observable and governed first, with the levers that exist today. Then measure what the work delivered.

Companion paper (in progress): "Proving the Loop Paid Off: Measuring and Governing the Value of Agentic AI Spend."

Sources and AI assistance. This article was drafted with AI assistance and verified by the author; every claim the argument depends on maps to a public source. Volatile figures (prices, plans, star counts) are dated as of June 2026 and should be re-checked before relying on them. Source pack: Anthropic prompt-caching documentation (official docs; multipliers verified June 2026) · GitHub, "GitHub Copilot is moving to usage-based billing" (official, Apr 27, 2026) · GitHub Changelog, "GitHub Copilot now supports OpenCode" (official, Jan 16, 2026) · Yahoo Finance syndicated reporting on Uber's AI-tool caps, citing The Information and Bloomberg (press, second-hand) · Visual Studio Magazine, "Copilot Billing Shock Hits Developers" (trade press, Jun 4, 2026) · OpenCode Zen and Go documentation (vendor self-description) · The Register on Anthropic's third-party-harness restriction (Feb 20, 2026) and on NVIDIA Rubin CPX (Sep 10, 2025) · Tailscale Aperture beta announcement (vendor, Apr 23, 2026) · Google Cloud Model Armor product page, Security Command Center release notes, and Model Armor–Agent Gateway integration documentation (vendor/official) · Amazon Bedrock Intelligent Prompt Routing documentation (vendor docs) · Claude Code model configuration documentation (official docs: opusplan and automatic model fallback, verified June 2026) · Claude cookbook, "Classifier fallback and billing for Claude Fable 5" (official docs: cache-read billing on fallback) · Decrypt on the Fable 5 silent-safeguard reversal, carrying Anthropic's statement (press, Jun 11, 2026) · TechCrunch on Glean's ARR and positioning (press, May 28, 2026) · Anthropic model price sheet (official docs; tier rates verified June 2026) · Frugal seed-round announcement via Business Wire (vendor launch positioning, Nov 12, 2025) · InfoWorld and Finout for the pre-existing "FinOps for agents" framing (context only; no claim depends on them). Routing ratios such as "60/30/10" are illustrative patterns, not benchmarks.

FinOps for Agents: A Practitioner Playbook for Routing, Gateways, and Prompt Caching ​

The Bill Became the Conversation ​

Lever 0. Spend Caps: Turn On the Brakes Before You Tune the Engine ​

Lever 1. Prompt Caching: The Most Verifiable Cut Available ​

Lever 2. LLM Gateways: Spend You Can See Is Spend You Can Govern ​

Lever 3. Tiered Routing: A Real Pattern, Not a Benchmarked Number ​

Underneath Levers 1 Through 3: Prefill and the KV-Cache ​

What This Buys You (and What It Doesn't) ​