o10Last updated 2026-06-09

State of Inference Spend 2026

Enterprise teams overpay for inference when traffic stays on default model routes. This report quantifies compliant savings from eval-gated routing across gateways, aggregators, and committed cloud capacity.

Research · June 2026 · Venue survey and workload models

Spread observed

638×

Routing modes

shadow → enforce

Framework

KYI

"Cheaper tokens miss the point. Up to 90% of an AI system's operational life is inference — where value, reliability, and risk are decided."

— Shen Pandi, Know Your Inference

Dashboards observe.
o10 enforces.

Cost dashboards tell you what you spent. o10 sits in the request path and changes what you spend — shadow first, then enforce.

SummaryKey takeaways

What you need to know

Short, self-contained answers with cited stats — read the sections below for full context.

What is the State of Inference Spend 2026?

Original o10 research quantifying compliant price spread across inference venues, workload savings models, and enterprise routing benchmarks — June 2026.

Up to 638× spread between most and least expensive compliant routes for identical workloads at the same quality floor.

Why do enterprises overpay for inference?

Traffic stays on default model routes across fragmented gateways. Finance sees blended invoices; platform teams lack a control point to enforce envelopes when prompts or retries change.

Teams without routing in the path leave an estimated 40–70% of compliant savings uncaptured.

How was the 638× spread measured?

o10 replayed representative enterprise workloads against candidate models on Vercel AI Gateway, OpenRouter, Amazon Bedrock committed capacity, and owned open-weight — at identical eval floors per use case.

Benchmarks use per-use-case quality floors, not global averages.

01Deep dive

Model price benchmark ($/1M tokens)

June 2026 venue survey across gateway, aggregator, and committed capacity pricing.

Open-weight 8B-class models clear many batch workloads at $0.05/1M on committed capacity versus $0.12 on gateways.

Mini-class tiers show 20–30% spread between gateway and committed routes — material at billions of tokens per month.

Frontier tiers remain expensive; most enterprise use cases clear below sonnet-class when evals are workload-specific.

Model price benchmark ($/1M tokens) — June 2026

Tier	Gateway	Aggregator	Committed
Open-weight 8B	$0.12	$0.08	$0.05
Haiku-class	$0.65	$0.48	$0.42
GPT-mini-class	$2.40	$2.10	$1.85
Sonnet-class	$9.40	$8.10	$7.20
Frontier	$31.90	$28.00	$24.50

02Deep dive

Workload savings models

Savings depend on use case, quality floor, and venue mix — not a single percentage.

RAG summarization at balanced floor: up to 80% versus default sonnet-class routing.

Support assistants at strict QA floor: 40–60% with mini-class compliant routes.

Batch classification at lean floor: up to 94% routing to open-weight when evals permit.

Segment by use case before modeling
Shadow mode verifies against your baseline
Committed Bedrock drawdown changes venue economics

03Deep dive

Implications for CFOs and platform teams

Inference spend is controllable when routing sits in the path with eval-gated selection.

Dashboards report last month's tokens. A control plane changes next month's routes — with proof in shadow first.

KYI adds board-grade governance above raw savings: performance, economics, integration, strategy, and risk.

How-toOperational steps

Applying the 2026 benchmarks

01
Segment traffic by use case
Support, RAG, code, batch — each has different volume and floor.
02
Run eval suites per workload
Define the cheapest compliant tier — not the default frontier model.
03
Shadow for 7–14 days
Build verified savings baseline per use case.
04
Enforce with envelopes
Hold budget and policy on every subsequent call.

SourceMethodology

o10 State of Inference Spend 2026. Venue survey June 2026. Workload models use per-use-case eval floors. Shen Pandi, KYI framework.

FAQFrequently asked questions

Common questions

What is the inference price spread in 2026?

o10 measured up to 638× between the most and least expensive compliant routes for identical enterprise workloads at the same per-use-case quality floor across venues in June 2026. The spread is not uniform — it varies by workload, eval floor, and venue mix — but it demonstrates that default model routes routinely overshoot cheapest compliant supply. Shadow mode proves your organization's spread against your traffic; this report documents the benchmark methodology and venue price tables behind the headline number.

What percentage of savings do teams leave uncaptured?

Teams without routing in the request path leave an estimated 40–70% of compliant savings uncaptured, according to o10 benchmark methodology in the State of Inference Spend 2026 report. The gap exists because traffic stays on default models across fragmented gateways while cheaper tiers would clear the same eval floor. Dashboards reveal the overspend after invoices arrive; a control plane captures it on the next call. Segmenting by use case is essential — RAG and batch often show the largest absolute delta.

Which venues were benchmarked?

The June 2026 venue survey covers Vercel AI Gateway (per-token API), OpenRouter (multi-provider aggregator), Amazon Bedrock (per-token and committed capacity), and owned or open-weight infrastructure. Pricing tables in the report show five model tiers from open-weight 8B ($0.05/1M on committed) through frontier ($31.90/1M on gateway). Production stacks commonly combine multiple venues; o10 routes above all of them under one policy and ledger.

How often should benchmarks be refreshed?

Refresh inference price benchmarks at least quarterly — provider list prices, committed capacity economics, and model tiers shift faster than enterprise procurement cycles. o10 updates hub and glossary stats when venue surveys change; the visible last-updated date and Article schema dateModified should match. Stale benchmarks mislead forecasts; continuous ledger data from shadow and enforce mode supersedes static tables for your organization.

What is a quality floor in benchmarks?

A quality floor is the minimum eval score a model must achieve for a specific use case before o10 routes production traffic to it. Floors are per workload — support, RAG, code, and batch clear at different bars — and measured by replaying representative traffic through eval suites, not assumed from vendor benchmarks. Once a cheaper candidate passes the floor, o10 can route to it in shadow (proof) or enforce (live). Floors without evals are hopes; evals without floors are expensive defaults. Benchmark comparisons use identical floors across venues — otherwise price spread measurements compare unlike routes and overstate savings.

Does committed Bedrock change the math?

Yes. Routing compliant workloads through Amazon Bedrock committed capacity draws down sunk cloud commitments and lowers marginal $/1M versus pure per-token API rates. Many enterprises underutilize signed Bedrock spend while live traffic bills marginal gateway rates. o10 models capex/opex crossover per use case and routes through committed venues when evals permit — turning reserved capacity into inference value.

What workloads save the most?

High-volume RAG summarization and batch classification typically show the largest absolute monthly savings because token volume multiplies small per-million price deltas. Support assistants save materially at strict QA floors when mini-class models clear evals. Code and agents benefit from per-step routing to prevent frontier defaults on every hop. Savings percentages range from 40–94% depending on floor and venue mix — shadow mode verifies your workloads.

How does shadow mode relate to this report?

Shadow mode mirrors live inference traffic through o10 without changing production routes. For every request, o10 evaluates candidate models against your per-use-case quality floors and records which route would have been cheapest and compliant — along with the cost delta — while the original provider still serves the response. Engineering sees proof without production risk; finance gets a verified savings figure tied to your traffic, not industry averages. Most teams run shadow for 7–14 days segmented by use case (support, RAG, code, batch) before flipping enforce mode. The report provides industry benchmark context; shadow provides your organization's proof against that methodology.

Who authored the research?

The State of Inference Spend 2026 report is published by o10 with benchmark methodology and venue survey data from June 2026. Framework context draws on Know Your Inference (KYI) by Shen Pandi, which governs how boards evaluate inference supply chains beyond per-token cost. Cite the report URL and methodology section when reproducing spread or pricing statistics.

Where is the full whitepaper?

The KYI framework whitepaper lives at o10.io/research/kyi-whitepaper with full HTML text, expanded FAQs, pillar scoring methodology, and board reporting guidance. It is the definitional source for Know Your Inference — designed for Wikipedia-style corroboration and AI answer engine citation alongside this spend report.

o10Set the envelope. o10 holds it.

See what you're overpaying.

Paste a week of traffic. Get the number that books the audit.

See what you're overpaying →

verified savings methodology · State of Inference Spend 2026