Operations essay · GPT-4o to GPT-5.x sizing
Do not resize a reasoning migration from the model name alone
A team plans to move its gpt-4o traffic to a GPT-5.x reasoning
model and needs a PTU number for the new deployment. The reflex is to reuse
what already works: carry the old gpt-4o PTU size forward, or
apply a family-level “5.x” multiplier to it, on the assumption that the
model name carries enough information to resize. But PTU sizing is driven by
token shape, not by the model name — the prompt size, visible output,
hidden reasoning tokens, cache hit rate, and response ceiling all move the
result. So the instinct to size from the name alone runs straight into the
operating question this essay asks — is the model name enough, or does
the request shape have to be re-measured first?
max_output_tokens) is treated
in this repository as an admission-risk hypothesis to test — not a
documented platform behavior — for whether it could dominate the PTU
the workload reserves. The one-pager
carries that single message and marks the boundary plainly: it is a
synthetic worked example over a modeled workload with pinned pricing
snapshots, not measured capacity.
Open the migration-sizing one-pager SVG.
Why re-measure instead of reusing the old size
The reuse reflex assumes the model name carries the size — that a
working gpt-4o number, scaled by a family multiplier, will
still fit GPT-5.x. Before accepting that, two things are worth putting
side by side. The first is what the documentation specifies: Microsoft
Learn converts traffic into normalized TPM from peak RPM, average prompt
and output tokens, cache rate, the model's output-to-input ratio, and its
Input TPM per PTU, and it deducts cached prompt tokens from utilization;
OpenAI documents reasoning tokens as non-visible output-side tokens
reported in output_tokens_details.reasoning_tokens. The
second is this repository's illustrative worked calculation, which runs
that documented formula over one example call shape so the size drivers
sit side by side — not a measurement of any deployment.
The pairing is the part worth pausing on. Read on visible output alone,
the migration can look manageable — the same prompt and the same
visible answer can size at roughly the same level, or even slightly
lighter, on the newer model. But the visible answer is not the whole
request: once hidden reasoning output appears, the same answer can need
far more normalized TPM, and this repository's working hypothesis —
an admission risk to test, not a documented platform behavior — is
that an oversized max_output_tokens ceiling could reserve
admission weight whether or not those tokens are ever generated. That is
why the operating question is not “is the new model
heavier?” but “what does the measured request shape — visible
output, reasoning output, cache rate, and response ceiling —
actually reserve?” The normalized-TPM formula, the per-model table, the
cache deduction, and the reasoning-token accounting are what Microsoft
Learn and OpenAI document; treating an oversized
max_output_tokens as PTU admission risk to test is
operational inference this repository defends. The chart that follows is a
synthetic worked example over a modeled workload with pinned pricing
snapshots — not measured capacity, live PTU throughput, or an SLA.
The illustrative chart below makes the size drivers explicit.
Question
What must be re-measured before a GPT-4o workload is sized for GPT-5.x on PTU?
Evidence
Microsoft Learn gives the normalized-TPM formula; OpenAI documents reasoning-token accounting.
Decision
Size from measured token shape, not from family-level assumptions about “5.x”.
Official behavior · normalized TPM
The sizing formula is about token shape
Microsoft Learn sizes PTU demand by converting traffic into normalized TPM. The inputs are peak RPM, average prompt tokens, average generated output tokens, cache rate, the model's output-to-input ratio, and the model's Input TPM per PTU value.
Formula:
normalized_tpm = rpm × (prompt_tokens × (1 - cache_rate) + output_ratio × output_tokens)
raw_ptu = normalized_tpm ÷ input_tpm_per_ptu
That formula is why a family-level shortcut can mislead. A model can have a heavier output ratio and still have a higher Input TPM per PTU than the previous model. The answer depends on your workload's actual prompt, output, and cache profile.
Operating lens · what changes in migration
Four drivers decide whether the migration fits
1. Output weighting
For GPT-4.1 and later, Microsoft Learn says output tokens carry a model-specific input-token weight for utilization.
2. Hidden reasoning
OpenAI documents reasoning tokens as non-visible output-side tokens. Measure them from output_tokens_details.reasoning_tokens.
3. Cache rate
Microsoft Learn says cached prompt tokens are deducted from PTU utilization. A prefix-shape change can therefore move the sizing result.
4. Response ceiling
Record max_output_tokens. This repo treats oversized ceilings as a PTU admission-risk hypothesis to test, not a value to guess.
Illustrative calculation · same visible workload
Hidden output can change the answer
| Row | Model | Cache rate | Generated output assumption | Rounded PTUs |
|---|---|---|---|---|
| Visible baseline | gpt-4o | 0% | 20 output tokens | 115 |
| Visible migration | gpt-5.2 | 0% | 20 output tokens | 110 |
| Visible migration + cache | gpt-5.2 | 50% | 20 output tokens | 80 |
| Reasoning migration | gpt-5.2 | 0% | 20 visible + 60 reasoning tokens | 250 |
| Reasoning migration + cache | gpt-5.2 | 50% | 20 visible + 60 reasoning tokens | 220 |
What this means
The visible-only rows do not support a blanket claim that GPT-5.x is always less efficient than GPT-4o. The reasoning rows show the trap: once hidden reasoning output appears, the same visible answer can need much more normalized TPM unless cache and response policy are measured and tuned.
Operator checklist · re-measure before buying capacity
Run the migration as a measurement, not a rename
- Pin the task mix. Keep prompt, tools, schema, retrieval placement, and sample distribution byte-stable across model cells.
- Measure visible output. Compare mean and p95 visible output tokens before and after migration.
- Measure reasoning output. Capture
reasoning_tokens; do not infer it from the final answer text. - Measure cache rate. Record
cached_tokens / prompt_tokensper cache-key bucket, especially after ReAct or tool-schema changes. - Right-size the ceiling. Set
max_output_tokensfrom a representative visible-plus-reasoning percentile, then use per-call overrides for rare long answers. - Only then size PTU. Apply the normalized-TPM formula with the current model table and your measured workload shape.
Sources and evidence boundary
Tier 1 — service contract (Microsoft Learn and OpenAI). The normalized-TPM sizing formula, the per-model Input TPM per PTU and output-to-input ratio, the GPT-4.1-and-later output-token weighting, the 100% cache deduction from utilization, and the reasoning-token accounting are documented here.
- [1] Determine PTU sizing for a workload — defines the normalized-TPM calculation (Input TPM = Peak RPM × prompt size; Output TPM = Peak RPM × response size; Normalized TPM = input TPM × (1 − cache rate) + output-to-input ratio × output TPM; PTUs = normalized TPM ÷ Input TPM per PTU), the per-model Input TPM per PTU and output-to-input ratio values, and that cached tokens are deducted 100% from the utilization calculation. accessed 2026-06-05: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/provisioned-throughput-sizing; https://web.archive.org/web/20260605011106/https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/provisioned-throughput-sizing.
- [2] What is provisioned throughput for Foundry Models? — documents the PTU utilization model and how request shape consumes reserved capacity. accessed 2026-06-04: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput; https://web.archive.org/web/20260426163819/https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput.
- [3] Prompt caching with Azure OpenAI — documents that cached prompt tokens are deducted from utilization, so a prefix-shape change moves the sizing result. accessed 2026-06-04: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching; https://web.archive.org/web/20260604174202/https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching.
- [4] OpenAI reasoning models guide — documents reasoning
tokens as non-visible output-side tokens reported in
output_tokens_details.reasoning_tokens. accessed 2026-06-04: https://developers.openai.com/api/docs/guides/reasoning; https://web.archive.org/web/20260602053729/https://developers.openai.com/api/docs/guides/reasoning.
Tier 2 — operational inference (this repository). The
treatment of an oversized max_output_tokens as a PTU
admission-risk hypothesis, and the measurement contract for reasoning and
cached tokens, are repository operational inference, not Learn
specifications.
- [5] this repository,
docs/13-ptu-vs-payg-decision-runbook.md— recordsmean_max_output_tokensand interprets a largemax_output_tokensas reserved admission weight to test, not a value to guess. source. - [6] this repository,
docs/14-observability-schema.md— the canonical PTU record contract mappingreasoning_tokens,cached_tokens, andmax_output_tokens_sentto their source usage fields. source.
What this topic does and does not prove. It documents the PTU sizing contract as Microsoft Learn states it on the access date — the normalized-TPM formula, the per-model Input TPM per PTU and output-to-input ratio, the GPT-4.1-and-later output-token weighting, and the 100% cache deduction — plus OpenAI's reasoning-token accounting and a repository rule of thumb: re-measure output, reasoning, cache, and ceiling before sizing. The illustrative chart applies the documented formula to one example call shape; it is a worked calculation, not a capacity guarantee, and it does not prove that any specific GPT-5.x workload is heavier or lighter than its GPT-4o predecessor across every region, model version, or traffic shape.
The practical rule
The practical rule: before moving GPT-4o traffic to GPT-5.x on
PTU, re-measure output, reasoning, cache, and max_output_tokens
ceiling policy, then size with the normalized-TPM formula, because the
model name alone is not enough information to size the workload.
This essay closes the operations arc; the loop returns to the methodology that decided which evidence was strong enough to publish in the first place.
How did the lab decide what evidence was strong enough to publish?