Operations essay · PTU recovery and 429 handling

429 is a recovery signal, not just an error

A traffic burst pushes a PTU deployment to 100% utilization and the calls start coming back as 429. The reflex is to reach for one of three moves: open a health-check polling loop until capacity looks free, retry on a blind fixed timer, or immediately route everything to another target. Each one assumes the client has to guess when it is safe to try again. It does not — the 429 already carries that answer in a header, retry-after-ms, so the operating question is not “did the call fail?” but “who owns the next decision: wait, redirect, or give up?”

Why read the header instead of guessing

All three reflexes share one hidden assumption: that the client cannot know when capacity will free up, so it has to probe, wait a fixed interval, or leave. Before accepting that, two things are worth putting side by side. The first is what Microsoft Learn documents: at 100% utilization the service returns 429 immediately and attaches retry-after-ms, a service-provided wait value for when to try again. The second is a sanitized public aggregate of the retry-after-ms values actually observed across this repository's source runs, included descriptively to show what those advised waits looked like in practice. Together they turn “when is it safe to try again?” from a client-side guess into a value the service hands back.

The shape of that aggregate is the part worth pausing on. The advised waits were short at the median — overall p50 ≈ 43 ms — so a health-check loop or a blind fixed-interval backoff would usually wait far longer than the service actually asked for. But the tail was not negligible: a small share of waits stretched toward 17,000 ms (overall p99 ≈ 16,921 ms). That short-median, long-tail shape is the operationally important fact — the header should be treated as an admission-control signal, not a fixed reset clock. The immediate 429 and the retry-after-ms contract are what Microsoft Learn documents; that a single latency-budget ceiling should decide when to stop waiting and redirect is operational inference this repository defends. The source chart below makes that distribution explicit.

Question

When PTU capacity says “not now,” how should the client decide the next move?

Evidence

Microsoft Learn provisioned-throughput and spillover documentation, plus this repository's PTU admission-controller operational note.

Decision

Use one retry owner. Honor the header, add jitter, and redirect only when latency policy demands it.

Scope note: This essay states the PTU 429 admission contract as Microsoft Learn documents it on the access date, plus a small set of operational rules of thumb this repository defends. It is a runbook for the retry wrapper, not a measurement of any specific deployment. The one observed retry-after-ms distribution it shows is a sanitized public aggregate from this repository's source runs, included descriptively to show that the advised waits were not constant — not a service-level guarantee.

Official spec · what the service states

The header is the admission signal

Microsoft Learn describes PTU high-utilization handling as an immediate 429 path with retry-after-ms and retry-after response headers. The same production guide explains that the wait can be used either for client-side retry or for redirecting the request to a different serving target.

What is documented

Treat retry-after-ms as a service-provided wait value. It is stronger than a guessed fixed reset interval.

What is not documented

The exact formula behind the wait value is not public. Do not encode a deterministic reset model in the client.

Operational inference

If the header already says when to try again, health-check polling is usually a weaker signal plus extra traffic and noisier telemetry.

A rate-limiting explainer animation. A burst of requests is shown twice. Without a limiter the burst hits the API directly and the API sheds load by returning 503. With a token-bucket limiter, within-limit requests reach a healthy API while over-limit requests get a 429 returned to the caller, so the API stays healthy. Conceptual illustration, not measured data. — **Why a 429 is a signal, not a wall:** the same burst of requests, two outcomes. Without a limiter the burst hits the API directly and it sheds load as 503. With a token-bucket limiter, within-limit requests reach a healthy API while over-limit requests get a 429 handed back to the caller — the admission signal this article is about. The header then says how long to wait. *Conceptual illustration, not measured data.*

Service behavior · why the wait is not a constant

The wait is admission control, not a fixed cooldown

Microsoft Learn documents PTU utilization as a variation of the leaky bucket algorithm. Each request adds an estimated compute cost — the prompt token count, minus cached tokens, plus the call's max_tokens — while deployed capacity drains continuously in proportion to the number of PTUs. When a request arrives and utilization is already at 100%, the service returns HTTP 429 immediately and attaches retry-after-ms and retry-after headers indicating how long to wait before the next request is accepted.

Because the wait reflects how the bucket is draining at that instant, it is not a fixed cooldown. The same documented contract therefore supports two client paths rather than one: honor the header when the wait fits the caller's latency budget, and treat any wait that exceeds that budget as a redirect-or-fail condition. That a single budget ceiling should split the two paths is operational inference this repository defends; the immediate 429 and the header contract are what Microsoft Learn documents.

Empirical cumulative distribution of observed retry-after-ms values. Each curve rises almost vertically near zero milliseconds, meaning most advised 429 waits in this sample were short, then flattens into a long tail extending to roughly seventeen thousand milliseconds for a small share of waits. Separate curves compare a burst-shaped source, a scheduled-shaped source, and the combined sample. — **Source chart — observed `retry-after-ms` distribution (descriptive).** **X-axis:** the wait the response header advised, in milliseconds. **Y-axis:** the empirical cumulative share of 429 responses at or below that wait. **Lines:** one curve per source group plus a combined curve — a burst-shaped source (n = 167), a scheduled-shaped source (n = 26), and the combined sample (n = 193); this is an empirical cumulative distribution, not a frequency histogram. **How to read it:** each curve rises almost to the top near zero — in this sample most advised waits were short (overall p50 ≈ 43 ms) — then flattens into a long tail reaching about 17,000 ms, so a small share of waits were very long (overall p99 ≈ 16,921 ms). That short-median, long-tail shape is why the operating pattern below caps the wait at a latency budget and then redirects or fails. **Evidence boundary:** this is one sanitized public aggregate of 429 events from this repository's source runs, shown descriptively; it is not a service-level guarantee, a reset model, or a measurement of any specific deployment, region, or future API version. **Source:** percentile values come from this repository's `results/retry-after-characterization/retry_after_ms_percentiles.csv` (source CSV).

Observed retry-after-ms percentiles (sanitized public aggregate)
Source group	429 count	p50	p90	p99	Max
Overall	193	43 ms	50.8 ms	16,921.12 ms	17,258 ms
Burst-shaped source	167	43 ms	49 ms	16,983.26 ms	17,258 ms
Scheduled-shaped source	26	3 ms	60 ms	60 ms	60 ms

Documented (Tier 1)

At 100% utilization the service returns 429 at once with retry-after-ms and retry-after; Learn calls the 429 a traffic-management signal, not a service error.

Documented (Tier 1)

The estimate uses max_tokens; setting it close to true generation size raises concurrency. Cached tokens receive a full discount and do not add to utilization.

Inferred (Tier 2)

That a caller should set one latency-budget ceiling and switch from waiting to redirecting above it is a repository rule of thumb, not a Learn specification.

Operating pattern · one owner for retry

Do not stack retry loops

The safe pattern is boring on purpose: pick exactly one component to own 429 recovery. If the SDK is configured to retry, let it own the wait. If an application router owns recovery, disable SDK auto-retry and make the router parse retry-after-ms directly.

PTU admission controller flow: observe 429, parse retry-after-ms, sleep or redirect, then resume. — **Reference implementation:** the public repo includes a small header-driven admission controller that parses `retry-after-ms`, falls back to `retry-after`, emits safe throttle events, and refuses double-retry ownership.

Sleep path

Use when throughput matters more than per-call latency. Wait the header value plus jitter, then try again.

Redirect path

Use when latency matters more than staying on PTU. Route to a standard target as soon as the 429 arrives.

Give-up path

Use when both wait and redirect violate policy. Return a controlled response rather than hiding retry storms.

Fallback · native or custom

Redirect is a policy choice, not a panic button

Microsoft Learn's spillover feature can redirect overage traffic to a standard target and exposes headers that identify spillover behavior. That is the low-operations path when the platform-supported shape fits. A custom router is still useful when the application must decide before the PTU target returns 429, or when routing policy depends on tenant, region, latency budget, or business priority.

Native spillover

Best when the simple overflow lane fits and extra routing logic would add more risk than value.

Custom router

Best when application policy must choose among wait, redirect, and give-up before user latency is spent.

Anti-pattern

SDK retry plus router retry plus spillover can multiply attempts. Consolidate ownership first.

Sources and evidence boundary

Tier 1 — service contract (Microsoft Learn). The 429 admission behavior, both retry headers, the leaky-bucket utilization model, and the SDK's default retry are documented here.

[1] Operate provisioned throughput deployments in production — defines the immediate 429 at 100% utilization, the retry-after-ms and retry-after response headers, the leaky-bucket utilization model, and the OpenAI SDK's default 429 retry (max_retries). accessed 2026-06-04: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/provisioned-get-started; https://web.archive.org/web/20260604161404/https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/provisioned-get-started.
[2] Manage traffic with spillover for provisioned deployments — defines the spillover redirect of non-200 (including 429) overage to a standard deployment via the spilloverDeploymentName property and the x-ms-spillover-from-deployment response header. accessed 2026-06-04: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/spillover-traffic-management; https://web.archive.org/web/20260604161530/https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/spillover-traffic-management.

Tier 2 — operational inference (this repository). The rules of thumb — one retry owner, honor the header, add jitter to de-synchronize a fleet, and redirect to spillover only after repeated honored waits fail to clear — are operational, not Learn specifications.

[3] this repository, docs/10-ptu-admission-controller.md — the header-driven admission controller note: single retry owner, honor retry-after-ms, default jitter, and fallback once the wait exceeds a ceiling. source.
[4] this repository, batch_runner/ptu/admission_controller.py — the reference implementation that parses retry-after-ms, falls back to retry-after, and refuses double-retry ownership. source.

What this topic does and does not prove. It documents the PTU admission contract as Microsoft Learn states it on the access date — the immediate 429 at 100% utilization, the retry-after-ms and retry-after headers, and the spillover redirect path — plus a small set of operational rules of thumb this repository defends: honor the header, add jitter to de-synchronize a fleet, and redirect to a spillover deployment only after repeated honored waits fail to clear. It does not validate that the contract holds identically across every region, model SKU, or future API version, and it does not measure any specific deployment under load.

The practical rule

The practical rule: when a PTU deployment returns 429, treat retry-after-ms as the admission contract — honor it, add jitter to de-synchronize the fleet, and redirect to a spillover deployment only after repeated honored waits fail to clear.

The next essay turns from when to retry to what a repeated prompt actually decides.

When the same prompt repeats, what does the cache key actually decide?