Same token price, different bill

A team migrates a workload onto a reasoning model, sees a familiar per-token price, and finds a new dial: reasoning effort. Turning it up feels almost free — so why not buy better answers everywhere? Then the bill climbs, even though the visible output barely changed. The reason is hidden "thinking" tokens: a reasoning model generates and bills them, but never shows them in the response. This project measures, on a small set of representative tasks, when those extra tokens earn their cost — and when they only inflate the bill.

Why teams care

After a model migration, the questions that reach an engineer are blunt: why did cost move, why did latency change, why are we seeing more throttling? Reasoning effort sits underneath all three. Left high by default, it quietly spends tokens on internal computation that never reaches the reader — real money on pay-as-you-go, and scarce headroom on provisioned capacity. Knowing where effort actually changes the answer is the difference between a confident rollout and a surprise on the invoice.

What we tested

We held the prompts and workload slices fixed and walked the same reasoning-effort ladder across each one — short factual answers, structured-to-natural-language synthesis, multi-step reasoning, and tool-using agents — capturing the full usage breakdown on every call: input, cached, reasoning, and output tokens, alongside a quality signal and latency. On top of that measured token shape we layered a modeled view of how the same workload would price under pay-as-you-go versus provisioned throughput.

What the evidence showed

The results split the way good evidence should — effort paid for itself in some places and only bought hidden computation in others. Four takeaways carry across the slices.

Always measure reasoning tokens

They are billed but never returned. Capture the full usage breakdown — input, cached, reasoning, and output — on every call.

Not every task needs reasoning

Short factual answers, structured-to-natural-language synthesis, and simple classification often saw cost rise without a matching quality gain.

Effort is the primary cost lever

Default the reasoning-effort knob to its lowest level and raise it only when a quality evaluation justifies the spend.

Billing model changes the payoff

On pay-as-you-go, cutting reasoning tokens lowers the bill. On provisioned throughput, the same cut is a throughput gain at a fixed bill.

Who this is for

Engineers and architects running an Azure OpenAI deployment who are deciding whether (and how much) reasoning to enable, sizing capacity, or debugging a cost, latency, or throttling change after a model migration.

Where to go next

Read the overview essay

The guided story: where effort wastes spend, where it earns a trial, and why hidden tokens must be budgeted explicitly. Start reading →

Browse the articles hub

The full reading arc — evidence topics, the bridge essay, and the operations notes on recovery, caching, and sizing. Open the hub →

Inspect the evidence dashboard

The audit trail: governed tables and rendered source charts behind every claim. Open the dashboard →

Read the full methodology

The decision framework and operator guidance live in the repository documentation on GitHub →

Glossary

Foundry: Azure AI Foundry — the managed platform used to deploy and serve the models evaluated in this project.
PTU: Provisioned Throughput Units — reserved, fixed-capacity billing where you pay for throughput rather than per token.
PAYG: Pay-as-you-go — consumption billing charged per input, cached, reasoning, and output token.
Reasoning tokens: Internal "thinking" tokens a reasoning model generates and bills but never returns in the response.
Cache: Prompt cache — reuse of previously processed prompt prefixes, billed at a reduced cached-input rate.
429: HTTP 429 Too Many Requests — the throttling response returned when a deployment exceeds its rate or capacity limit.