when reasoning pays off

Operations essay · GPT-4o to GPT-5.x sizing

Do not resize a reasoning migration from the model name alone

A team plans to move its gpt-4o traffic to a GPT-5.x reasoning model and needs a PTU number for the new deployment. The reflex is to reuse what already works: carry the old gpt-4o PTU size forward, or apply a family-level “5.x” multiplier to it, on the assumption that the model name carries enough information to resize. But PTU sizing is driven by token shape, not by the model name — the prompt size, visible output, hidden reasoning tokens, cache hit rate, and response ceiling all move the result. So the instinct to size from the name alone runs straight into the operating question this essay asks — is the model name enough, or does the request shape have to be re-measured first?

This essay at a glance. Migration sizing is modeled arithmetic, not a model-name lookup: recommended PTU = ceil(demand ÷ density ÷ target utilization), and an oversized response ceiling (max_output_tokens) is treated in this repository as an admission-risk hypothesis to test — not a documented platform behavior — for whether it could dominate the PTU the workload reserves. The one-pager carries that single message and marks the boundary plainly: it is a synthetic worked example over a modeled workload with pinned pricing snapshots, not measured capacity. Open the migration-sizing one-pager SVG.
One-pager for reasoning migration sizing. It traces the sizing chain from demand TPM through the prompt plus max_output reservation, divided by density and target utilization, to a recommended PTU ceiling, and flags an oversized response ceiling (max_output_tokens) as a sizing risk to verify rather than a documented platform behavior. Headline message: migration sizing is modeled arithmetic. Marked explicitly as a synthetic worked example, not measured capacity.

Why re-measure instead of reusing the old size

The reuse reflex assumes the model name carries the size — that a working gpt-4o number, scaled by a family multiplier, will still fit GPT-5.x. Before accepting that, two things are worth putting side by side. The first is what the documentation specifies: Microsoft Learn converts traffic into normalized TPM from peak RPM, average prompt and output tokens, cache rate, the model's output-to-input ratio, and its Input TPM per PTU, and it deducts cached prompt tokens from utilization; OpenAI documents reasoning tokens as non-visible output-side tokens reported in output_tokens_details.reasoning_tokens. The second is this repository's illustrative worked calculation, which runs that documented formula over one example call shape so the size drivers sit side by side — not a measurement of any deployment.

The pairing is the part worth pausing on. Read on visible output alone, the migration can look manageable — the same prompt and the same visible answer can size at roughly the same level, or even slightly lighter, on the newer model. But the visible answer is not the whole request: once hidden reasoning output appears, the same answer can need far more normalized TPM, and this repository's working hypothesis — an admission risk to test, not a documented platform behavior — is that an oversized max_output_tokens ceiling could reserve admission weight whether or not those tokens are ever generated. That is why the operating question is not “is the new model heavier?” but “what does the measured request shape — visible output, reasoning output, cache rate, and response ceiling — actually reserve?” The normalized-TPM formula, the per-model table, the cache deduction, and the reasoning-token accounting are what Microsoft Learn and OpenAI document; treating an oversized max_output_tokens as PTU admission risk to test is operational inference this repository defends. The chart that follows is a synthetic worked example over a modeled workload with pinned pricing snapshots — not measured capacity, live PTU throughput, or an SLA. The illustrative chart below makes the size drivers explicit.

Question

What must be re-measured before a GPT-4o workload is sized for GPT-5.x on PTU?

Evidence

Microsoft Learn gives the normalized-TPM formula; OpenAI documents reasoning-token accounting.

Decision

Size from measured token shape, not from family-level assumptions about “5.x”.

The sizing formula is about token shape

Microsoft Learn sizes PTU demand by converting traffic into normalized TPM. The inputs are peak RPM, average prompt tokens, average generated output tokens, cache rate, the model's output-to-input ratio, and the model's Input TPM per PTU value.

Formula:

normalized_tpm = rpm × (prompt_tokens × (1 - cache_rate) + output_ratio × output_tokens)

raw_ptu = normalized_tpm ÷ input_tpm_per_ptu

That formula is why a family-level shortcut can mislead. A model can have a heavier output ratio and still have a higher Input TPM per PTU than the previous model. The answer depends on your workload's actual prompt, output, and cache profile.

Four drivers decide whether the migration fits

1. Output weighting

For GPT-4.1 and later, Microsoft Learn says output tokens carry a model-specific input-token weight for utilization.

2. Hidden reasoning

OpenAI documents reasoning tokens as non-visible output-side tokens. Measure them from output_tokens_details.reasoning_tokens.

3. Cache rate

Microsoft Learn says cached prompt tokens are deducted from PTU utilization. A prefix-shape change can therefore move the sizing result.

4. Response ceiling

Record max_output_tokens. This repo treats oversized ceilings as a PTU admission-risk hypothesis to test, not a value to guess.

Hidden output can change the answer

Chart to read: a 1,000 RPM example with 200 input tokens and 20 visible output tokens. The first three bars use the official normalized-TPM formula and model table. The final two bars add a 60-token reasoning assumption to show why invisible output must be measured.
Illustrative PTU sizing chart comparing GPT-4o and GPT-5.2 migration drivers.
Illustrative sizing rows
Row Model Cache rate Generated output assumption Rounded PTUs
Visible baselinegpt-4o0%20 output tokens115
Visible migrationgpt-5.20%20 output tokens110
Visible migration + cachegpt-5.250%20 output tokens80
Reasoning migrationgpt-5.20%20 visible + 60 reasoning tokens250
Reasoning migration + cachegpt-5.250%20 visible + 60 reasoning tokens220

What this means

The visible-only rows do not support a blanket claim that GPT-5.x is always less efficient than GPT-4o. The reasoning rows show the trap: once hidden reasoning output appears, the same visible answer can need much more normalized TPM unless cache and response policy are measured and tuned.

Run the migration as a measurement, not a rename

  1. Pin the task mix. Keep prompt, tools, schema, retrieval placement, and sample distribution byte-stable across model cells.
  2. Measure visible output. Compare mean and p95 visible output tokens before and after migration.
  3. Measure reasoning output. Capture reasoning_tokens; do not infer it from the final answer text.
  4. Measure cache rate. Record cached_tokens / prompt_tokens per cache-key bucket, especially after ReAct or tool-schema changes.
  5. Right-size the ceiling. Set max_output_tokens from a representative visible-plus-reasoning percentile, then use per-call overrides for rare long answers.
  6. Only then size PTU. Apply the normalized-TPM formula with the current model table and your measured workload shape.

Sources and evidence boundary

Tier 1 — service contract (Microsoft Learn and OpenAI). The normalized-TPM sizing formula, the per-model Input TPM per PTU and output-to-input ratio, the GPT-4.1-and-later output-token weighting, the 100% cache deduction from utilization, and the reasoning-token accounting are documented here.

Tier 2 — operational inference (this repository). The treatment of an oversized max_output_tokens as a PTU admission-risk hypothesis, and the measurement contract for reasoning and cached tokens, are repository operational inference, not Learn specifications.

What this topic does and does not prove. It documents the PTU sizing contract as Microsoft Learn states it on the access date — the normalized-TPM formula, the per-model Input TPM per PTU and output-to-input ratio, the GPT-4.1-and-later output-token weighting, and the 100% cache deduction — plus OpenAI's reasoning-token accounting and a repository rule of thumb: re-measure output, reasoning, cache, and ceiling before sizing. The illustrative chart applies the documented formula to one example call shape; it is a worked calculation, not a capacity guarantee, and it does not prove that any specific GPT-5.x workload is heavier or lighter than its GPT-4o predecessor across every region, model version, or traffic shape.

The practical rule

The practical rule: before moving GPT-4o traffic to GPT-5.x on PTU, re-measure output, reasoning, cache, and max_output_tokens ceiling policy, then size with the normalized-TPM formula, because the model name alone is not enough information to size the workload.

This essay closes the operations arc; the loop returns to the methodology that decided which evidence was strong enough to publish in the first place.

How did the lab decide what evidence was strong enough to publish?