when reasoning pays off

Operations essay · prompt cache retention

Retention is a request policy, not an assumption

Prompt caching can feel automatic, because supported models cache eligible prefixes by default. So the reflex is to lean on it: a workload reuses a shared prefix after an idle gap, across a shift boundary, after a batch pause, or on a later customer turn, and the request body simply omits prompt_cache_retention and trusts the window to still be there. But “cached by default” is not the same as “retained as long as the operation assumes.” If the request omits retention, the operation may be counting on a cache window it never asked for — so the operating question is not “is caching on?” but “did the request actually ask for the reuse window this workload depends on?”

This essay at a glance. Prompt caching is automatic, but the retention window is a request policy: Microsoft Learn documents an in_memory default that can clear within minutes and an explicit 24h window. The one-pager carries that single message — make retention explicit when later reuse matters — and marks the boundary: the retention modes are official spec, while the first-token-latency shape beside them is a sanitized public aggregate measured on pay-as-you-go, shown descriptively. Open the cache-retention one-pager SVG.
One-pager for prompt cache retention. It contrasts in-memory retention, the short default lifetime, with an explicit 24-hour window that keeps warm prefixes routable, and shows the 24h cell with a lower first-token-latency tail in this source run. Headline message: the retention window matters, so set it explicitly. Marked as the documented retention modes plus a sanitized measured aggregate, descriptive only.

Why set the window instead of assuming it

The default-is-fine reflex assumes the cache window will outlast the gap the workload leaves between reuses. Before accepting that, two things are worth putting side by side. The first is what Microsoft Learn documents: there are two retention policies, in_memory and 24h; in-memory entries are typically cleared within 5 to 10 minutes of inactivity and always removed within one hour of last use, while extended retention can keep cached prefixes routable for up to 24 hours; for gpt-5.4 and older models an omitted value means in_memory, and prompt-cache pricing is the same for both policies. The second is a sparse sanitized public aggregate from this repository's source runs, grouped by retention mode, included descriptively to show how first-token latency and cache hit ratio actually moved under in_memory versus 24h.

The pairing is the part worth pausing on. Hit ratio alone can read as reassuring while the retention choice surfaces somewhere else: in this slice both modes held a high hit ratio (about 0.93 to 0.96), so reading hit ratio by itself would suggest the retention mode barely mattered. The first-token latency tail tells the other half of the story — at cardinality 1 the in_memory series sat near 106,000 ms p95 while the 24h series sat near 9,900 ms, so the retention choice showed up in the latency tail, not in hit ratio. That is why retention should be read with latency beside cached-token share — and why it is a request policy and a governance choice, not just a performance knob. The two retention modes, their lifetimes, the in_memory default, the equal pricing, and the in-region condition for extended retention are what Microsoft Learn documents; the public latency-beside-hit-ratio slice is source-run evidence this repository defends, not a universal latency curve. The charts below make that pairing explicit.

Question

When the operation expects cache reuse later, does the request actually ask for that window?

Evidence

Microsoft Learn defines the retention modes; this repo adds sparse public latency and hit-rate evidence.

Decision

Make retention explicit, keep the prefix stable, and read latency beside cached-token share.

Automatic caching is not the same as explicit retention

Microsoft Learn describes two prompt-cache retention policies: in_memory and 24h. In-memory cache entries are typically cleared within 5 to 10 minutes of inactivity and are always removed within one hour of last use. Extended retention can keep cached prefixes active for longer, up to a maximum of 24 hours.

Default can be short

For gpt-5.4 and older models, omitted retention means in_memory.

Longer reuse is explicit

If the operation expects later reuse, set prompt_cache_retention to 24h where supported.

Price is not the differentiator

Microsoft Learn states prompt-cache pricing is the same for both retention policies.

Latency made the retention choice visible

Chart to read: public aggregate cache-key slice, now grouped by retention mode. The left panel shows p95 time to first token. The right panel shows cache hit ratio. Each group compares in_memory with 24h for the same cardinality.
Prompt cache retention evidence chart comparing TTFT p95 and cache hit ratio.
Source chart — first-token latency p95 by retention mode. X-axis: bucket cardinality on a log2 scale (1, then 8). Y-axis: steady-state p95 time to first token, in milliseconds. Lines: the two retention policies this essay compares — in_memory and 24h, N = 960 records each; a two-series line chart, not a frequency histogram. How to read it: at cardinality 1 the in_memory series sat at about 106,000 ms p95 while the 24h series sat at about 9,900 ms — in this cell the retention choice was visible in the latency tail — and the two series moved closer once traffic was spread across 8 buckets. Evidence boundary: one sparse sanitized public aggregate on pay-as-you-go gpt-5.2 (not PTU); source-run evidence, not a universal latency rule. Source: results/cache-key-bucketing/ttft_p95_vs_cardinality.csv (source CSV).
Line chart of steady-state p95 time to first token against bucket cardinality on a log2 axis, with one line per retention mode. The in_memory line starts very high near one hundred six thousand milliseconds at a single bucket and falls to about sixteen thousand at eight buckets, while the 24h line stays lower and roughly flat near ten to fifteen thousand milliseconds.
Paired source chart — cache hit ratio by retention mode. X-axis: the same log2 bucket-cardinality axis (1, then 8). Y-axis: steady-state cache hit ratio, 0–1. Lines: the same in_memory and 24h series, N = 960 each. How to read it: both series stayed high in this slice (about 0.93–0.96), with the 24h series slightly higher at cardinality 1; the larger retention difference here showed up in the latency panel above, not in hit ratio — which is why retention should be read with latency beside cached-token share. Evidence boundary: the same sparse sanitized public aggregate on pay-as-you-go gpt-5.2 (not PTU); descriptive, not a universal cache-hit rule. Source: results/cache-key-bucketing/cache_hit_ratio_vs_cardinality.csv (source CSV).
Line chart of steady-state cache hit ratio against bucket cardinality on a log2 axis, with one line per retention mode. Both the in_memory and 24h lines sit high between roughly 0.93 and 0.96, with the 24h line slightly higher at a single bucket, staying about flat as cardinality increases.
Public retention evidence
Retention Cardinality Hit ratio TTFT p95 Steady records
in-memory10.9334106,389.96 ms388
24h10.96129,899.74 ms390
in-memory80.958616,623.66 ms389
24h80.961215,549.08 ms389

What this does and does not prove

In this public slice, the cardinality-1 cell made retention visible in first-token latency. That is source-run evidence, not a universal rule. The durable lesson is simpler: read retention mode, hit ratio, and latency together.

If retention matters, make the choice observable

The public repo includes a small retention-policy helper that rejects an omitted value on models whose documented default is in_memory. That sounds strict, but it prevents the common failure mode: the design assumes a longer cache window while the request body silently takes the short default.

Good

prompt_cache_retention="24h" when the workload depends on later reuse.

Also good

prompt_cache_retention="in_memory" when the workload only needs short reuse and says so explicitly.

Risky

Leaving the value absent while the operating runbook assumes a longer window.

Longer retention also needs a governance check

Microsoft Learn states that in-memory prompt caching is compatible with all data-residency regions. For extended retention, it says cached data stays in-region only for Regional Standard or Regional Provisioned modes. That makes retention a governance decision as well as a performance decision.

Longer retention is therefore a governance decision as much as a performance one: confirm that the serving mode keeps cached data in-region before relying on the extended window, instead of leaving the choice to a default.

Sources and evidence boundary

Tier 1 — service contract (Microsoft Learn). The two retention policies, the in-memory clear-and-removal windows, the 24-hour extended-retention ceiling, the gpt-5.4-and-older in_memory default, the equal pricing across both policies, and the data-residency behavior are documented here.

Tier 2 — operational inference (this repository). The retention-policy helper that fails closed on an omitted value for in_memory-default models, and the sparse public hit-ratio and first-token-latency slice, are repository operational inference and source-run evidence, not Learn specifications.

What this topic does and does not prove. It documents the prompt-cache retention contract as Microsoft Learn states it on the access date — the in_memory and 24h policies, the in-memory clear-and-removal windows, the 24-hour extended ceiling, the gpt-5.4-and-older in_memory default, equal pricing across both policies, and the in-region residency condition for extended retention — plus a repository rule of thumb: make prompt_cache_retention explicit and fail closed when it is omitted on an in_memory-default model. The single public slice showed retention becoming visible in first-token latency at cardinality 1; that is source-run evidence, not proof of a universal latency curve across every model, region, or traffic shape.

The practical rule

The practical rule: when cache reuse must survive a short idle window on a model that can fall back to in_memory, set prompt_cache_retention explicitly instead of trusting the default, keep the cacheable prefix stable, and read first-token latency beside cached-token share to confirm the window you asked for is the window you got.

The next essay turns from a single request's cache policy to what changes when a whole GPT-4o workload is re-sized for GPT-5.x on PTU.

When GPT-4o traffic moves to GPT-5.x on PTU, what actually changes in the capacity math?