Operations essay · GPT-4o to GPT-5.x sizing

Do not resize a reasoning migration from the model name alone

A team plans to move its gpt-4o traffic to a GPT-5.x reasoning model and needs a PTU number for the new deployment. The reflex is to reuse what already works: carry the old gpt-4o PTU size forward, or apply a family-level “5.x” multiplier to it, on the assumption that the model name carries enough information to resize. But PTU sizing is driven by token shape, not by the model name — the prompt size, visible output, hidden reasoning tokens, cache hit rate, and response ceiling all move the result. So the instinct to size from the name alone runs straight into the operating question this essay asks — is the model name enough, or does the request shape have to be re-measured first?

Why re-measure instead of reusing the old size

The reuse reflex assumes the model name carries the size — that a working gpt-4o number, scaled by a family multiplier, will still fit GPT-5.x. Before accepting that, two things are worth putting side by side. The first is what the documentation specifies: Microsoft Learn converts traffic into normalized TPM from peak RPM, average prompt and output tokens, cache rate, the model's output-to-input ratio, and its Input TPM per PTU, and it deducts cached prompt tokens from utilization; OpenAI documents reasoning tokens as non-visible output-side tokens reported in output_tokens_details.reasoning_tokens. The second is this repository's illustrative worked calculation, which runs that documented formula over one example call shape so the size drivers sit side by side — not a measurement of any deployment.

The pairing is the part worth pausing on. Read on visible output alone, the migration can look manageable — the same prompt and the same visible answer can size at roughly the same level, or even slightly lighter, on the newer model. But the visible answer is not the whole request: once hidden reasoning output appears, the same answer can need far more normalized TPM, and this repository's working hypothesis — an admission risk to test, not a documented platform behavior — is that an oversized max_output_tokens ceiling could reserve admission weight whether or not those tokens are ever generated. That is why the operating question is not “is the new model heavier?” but “what does the measured request shape — visible output, reasoning output, cache rate, and response ceiling — actually reserve?” The normalized-TPM formula, the per-model table, the cache deduction, and the reasoning-token accounting are what Microsoft Learn and OpenAI document; treating an oversized max_output_tokens as PTU admission risk to test is operational inference this repository defends. The chart that follows is a synthetic worked example over a modeled workload with pinned pricing snapshots — not measured capacity, live PTU throughput, or an SLA. The illustrative chart below makes the size drivers explicit.

Question

What must be re-measured before a GPT-4o workload is sized for GPT-5.x on PTU?

Evidence

Microsoft Learn gives the normalized-TPM formula; OpenAI documents reasoning-token accounting.

Decision

Size from measured token shape, not from family-level assumptions about “5.x”.

Official behavior · normalized TPM

The sizing formula is about token shape

Microsoft Learn sizes PTU demand by converting traffic into normalized TPM. The inputs are peak RPM, average prompt tokens, average generated output tokens, cache rate, the model's output-to-input ratio, and the model's Input TPM per PTU value.

Formula:

normalized_tpm = rpm × (prompt_tokens × (1 - cache_rate) + output_ratio × output_tokens)

raw_ptu = normalized_tpm ÷ input_tpm_per_ptu

That formula is why a family-level shortcut can mislead. A model can have a heavier output ratio and still have a higher Input TPM per PTU than the previous model. The answer depends on your workload's actual prompt, output, and cache profile.

Operating lens · what changes in migration

Four drivers decide whether the migration fits

1. Output weighting

For GPT-4.1 and later, Microsoft Learn says output tokens carry a model-specific input-token weight for utilization.

2. Hidden reasoning

OpenAI documents reasoning tokens as non-visible output-side tokens. Measure them from output_tokens_details.reasoning_tokens.

3. Cache rate

Microsoft Learn says cached prompt tokens are deducted from PTU utilization. A prefix-shape change can therefore move the sizing result.

4. Response ceiling

Record max_output_tokens. This repo treats oversized ceilings as a PTU admission-risk hypothesis to test, not a value to guess.

Illustrative calculation · same visible workload

Hidden output can change the answer

Illustrative PTU sizing chart comparing GPT-4o and GPT-5.2 migration drivers. — **Chart to read:** a 1,000 RPM example with 200 input tokens and 20 visible output tokens. The first three bars use the official normalized-TPM formula and model table. The final two bars add a 60-token reasoning assumption to show why invisible output must be measured.

Illustrative sizing rows
Row	Model	Cache rate	Generated output assumption	Rounded PTUs
Visible baseline	gpt-4o	0%	20 output tokens	115
Visible migration	gpt-5.2	0%	20 output tokens	110
Visible migration + cache	gpt-5.2	50%	20 output tokens	80
Reasoning migration	gpt-5.2	0%	20 visible + 60 reasoning tokens	250
Reasoning migration + cache	gpt-5.2	50%	20 visible + 60 reasoning tokens	220

What this means

The visible-only rows do not support a blanket claim that GPT-5.x is always less efficient than GPT-4o. The reasoning rows show the trap: once hidden reasoning output appears, the same visible answer can need much more normalized TPM unless cache and response policy are measured and tuned.

Operator checklist · re-measure before buying capacity

Run the migration as a measurement, not a rename

Pin the task mix. Keep prompt, tools, schema, retrieval placement, and sample distribution byte-stable across model cells.
Measure visible output. Compare mean and p95 visible output tokens before and after migration.
Measure reasoning output. Capture reasoning_tokens; do not infer it from the final answer text.
Measure cache rate. Record cached_tokens / prompt_tokens per cache-key bucket, especially after ReAct or tool-schema changes.
Right-size the ceiling. Set max_output_tokens from a representative visible-plus-reasoning percentile, then use per-call overrides for rare long answers.
Only then size PTU. Apply the normalized-TPM formula with the current model table and your measured workload shape.

Sources and evidence boundary

Tier 1 — service contract (Microsoft Learn and OpenAI). The normalized-TPM sizing formula, the per-model Input TPM per PTU and output-to-input ratio, the GPT-4.1-and-later output-token weighting, the 100% cache deduction from utilization, and the reasoning-token accounting are documented here.

[1] Determine PTU sizing for a workload — defines the normalized-TPM calculation (Input TPM = Peak RPM × prompt size; Output TPM = Peak RPM × response size; Normalized TPM = input TPM × (1 − cache rate) + output-to-input ratio × output TPM; PTUs = normalized TPM ÷ Input TPM per PTU), the per-model Input TPM per PTU and output-to-input ratio values, and that cached tokens are deducted 100% from the utilization calculation. accessed 2026-06-05: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/provisioned-throughput-sizing; https://web.archive.org/web/20260605011106/https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/provisioned-throughput-sizing.
[2] What is provisioned throughput for Foundry Models? — documents the PTU utilization model and how request shape consumes reserved capacity. accessed 2026-06-04: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput; https://web.archive.org/web/20260426163819/https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput.
[3] Prompt caching with Azure OpenAI — documents that cached prompt tokens are deducted from utilization, so a prefix-shape change moves the sizing result. accessed 2026-06-04: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching; https://web.archive.org/web/20260604174202/https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching.
[4] OpenAI reasoning models guide — documents reasoning tokens as non-visible output-side tokens reported in output_tokens_details.reasoning_tokens. accessed 2026-06-04: https://developers.openai.com/api/docs/guides/reasoning; https://web.archive.org/web/20260602053729/https://developers.openai.com/api/docs/guides/reasoning.

Tier 2 — operational inference (this repository). The treatment of an oversized max_output_tokens as a PTU admission-risk hypothesis, and the measurement contract for reasoning and cached tokens, are repository operational inference, not Learn specifications.

[5] this repository, docs/13-ptu-vs-payg-decision-runbook.md — records mean_max_output_tokens and interprets a large max_output_tokens as reserved admission weight to test, not a value to guess. source.
[6] this repository, docs/14-observability-schema.md — the canonical PTU record contract mapping reasoning_tokens, cached_tokens, and max_output_tokens_sent to their source usage fields. source.

What this topic does and does not prove. It documents the PTU sizing contract as Microsoft Learn states it on the access date — the normalized-TPM formula, the per-model Input TPM per PTU and output-to-input ratio, the GPT-4.1-and-later output-token weighting, and the 100% cache deduction — plus OpenAI's reasoning-token accounting and a repository rule of thumb: re-measure output, reasoning, cache, and ceiling before sizing. The illustrative chart applies the documented formula to one example call shape; it is a worked calculation, not a capacity guarantee, and it does not prove that any specific GPT-5.x workload is heavier or lighter than its GPT-4o predecessor across every region, model version, or traffic shape.

The practical rule

The practical rule: before moving GPT-4o traffic to GPT-5.x on PTU, re-measure output, reasoning, cache, and max_output_tokens ceiling policy, then size with the normalized-TPM formula, because the model name alone is not enough information to size the workload.

This essay closes the operations arc; the loop returns to the methodology that decided which evidence was strong enough to publish in the first place.

How did the lab decide what evidence was strong enough to publish?