Read the overview essay
The guided story: where effort wastes spend, where it earns a trial, and why hidden tokens must be budgeted explicitly. Start reading →
A team migrates a workload onto a reasoning model, sees a familiar per-token price, and finds a new dial: reasoning effort. Turning it up feels almost free — so why not buy better answers everywhere? Then the bill climbs, even though the visible output barely changed. The reason is hidden "thinking" tokens: a reasoning model generates and bills them, but never shows them in the response. This project measures, on a small set of representative tasks, when those extra tokens earn their cost — and when they only inflate the bill.
After a model migration, the questions that reach an engineer are blunt: why did cost move, why did latency change, why are we seeing more throttling? Reasoning effort sits underneath all three. Left high by default, it quietly spends tokens on internal computation that never reaches the reader — real money on pay-as-you-go, and scarce headroom on provisioned capacity. Knowing where effort actually changes the answer is the difference between a confident rollout and a surprise on the invoice.
We held the prompts and workload slices fixed and walked the same reasoning-effort ladder across each one — short factual answers, structured-to-natural-language synthesis, multi-step reasoning, and tool-using agents — capturing the full usage breakdown on every call: input, cached, reasoning, and output tokens, alongside a quality signal and latency. On top of that measured token shape we layered a modeled view of how the same workload would price under pay-as-you-go versus provisioned throughput.
The results split the way good evidence should — effort paid for itself in some places and only bought hidden computation in others. Four takeaways carry across the slices.
They are billed but never returned. Capture the full usage breakdown — input, cached, reasoning, and output — on every call.
Short factual answers, structured-to-natural-language synthesis, and simple classification often saw cost rise without a matching quality gain.
Default the reasoning-effort knob to its lowest level and raise it only when a quality evaluation justifies the spend.
On pay-as-you-go, cutting reasoning tokens lowers the bill. On provisioned throughput, the same cut is a throughput gain at a fixed bill.
Engineers and architects running an Azure OpenAI deployment who are deciding whether (and how much) reasoning to enable, sizing capacity, or debugging a cost, latency, or throttling change after a model migration.
The guided story: where effort wastes spend, where it earns a trial, and why hidden tokens must be budgeted explicitly. Start reading →
The full reading arc — evidence topics, the bridge essay, and the operations notes on recovery, caching, and sizing. Open the hub →
The audit trail: governed tables and rendered source charts behind every claim. Open the dashboard →
The decision framework and operator guidance live in the repository documentation on GitHub →