Documentation

CachePilot is a governed proxy for the OpenAI Responses API and the Anthropic Messages API. Swap your base URL, add one header, and every request gets policy enforcement, cache optimization, and auditable receipts.

Two providers. OpenAI models route through /v1/responses; Anthropic models (including Claude Fable 5) route through /v1/messages. Both use BYOK — your upstream key passes through untouched.

For coding agents

If you are a coding agent helping a user govern LLM spend, start here. Treat the human as the budget and policy owner. Evaluate one representative workflow in observe mode before recommending enforcement.

# Evaluation checklist 1. Ask the user to create/select a CachePilot project in /dashboard. 2. Obtain the one-time cp_live_ project key; never print or commit secrets. 3. Set the OpenAI-compatible base URL: https://api.cachepilot.clclabs.ai/v1 4. Keep Authorization: Bearer <provider-key> and add: X-CachePilot-Key: cp_live_... 5. Send one representative request to /v1/responses. 6. Verify X-CP-Receipt-Id, X-CP-Policy-Version, and X-CP-Output-Budget-Applied in the response. 7. Compare spend, cache usage, and latency with the direct baseline. 8. Ask the human policy owner before changing observe mode to govern mode.

Claude/Fable adapters: An OpenAI-compatible adapter can receive full governance when it targets the governed Responses surface. The Chat Completions compatibility route is telemetry-only today. Native Claude traffic can use /v1/messages. See the agent spend governance use case and the provider compatibility matrix.

⚡ Quickstart

1. Get your project key — Create a project in the Dashboard. You'll receive cp_live_... — save it, it's shown once.

2. Swap your base URL — Point your OpenAI client to the CachePilot endpoint:

# Before (direct) curl https://api.openai.com/v1/responses \ -H "Authorization: Bearer sk-your-key" \ -d '{"model":"gpt-4.1","input":"Hello"}' # After (through CachePilot) curl https://api.cachepilot.clclabs.ai/v1/responses \ -H "Authorization: Bearer sk-your-key" \ -H "X-CachePilot-Key: cp_live_proj_abc" \ -d '{"model":"gpt-4.1","input":"Hello"}'

3. Read the receipt headers — Every response includes X-CP-* headers proving what policy was applied.

That's it. The proxy is fully transparent to the OpenAI SDK. Your Authorization header (your OpenAI key) passes through untouched — BYOK.

Python / TypeScript SDK:

# Python from openai import OpenAI client = OpenAI( base_url="https://api.cachepilot.clclabs.ai/v1", default_headers={ "X-CachePilot-Key": "cp_live_proj_abc", }, ) response = client.responses.create( model="gpt-4.1", input="Refactor the auth module...", tools=[{"type": "shell"}, {"type": "code_interpreter"}], stream=True, )

// TypeScript import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.cachepilot.clclabs.ai/v1", defaultHeaders: { "X-CachePilot-Key": "cp_live_proj_abc", }, }); const response = await client.responses.create({ model: "gpt-4.1", input: "Refactor the auth module...", tools: [{ type: "shell" }, { type: "code_interpreter" }], stream: true, });

Anthropic (Claude Fable 5): point the Anthropic SDK at the same base URL with a compound key — cp_live_...:sk-ant-... — and requests route through /v1/messages with the same governance and receipts. On Fable 5, the proxy also strips request params the model rejects (temperature, top_p, top_k) and can inject a cache_control breakpoint when you haven't set one — every change is listed in X-CP-Mutations.

// TypeScript (Anthropic SDK) import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic({ baseURL: "https://api.cachepilot.clclabs.ai/v1", apiKey: "cp_live_proj_abc:sk-ant-your-key", // compound BYOK }); const response = await client.messages.create({ model: "claude-fable-5", max_tokens: 16000, system: "You are a helpful assistant.", messages: [{ role: "user", content: "Refactor the auth module..." }], });

Cache receipts. Anthropic reports cache reads and writes explicitly, so every non-streaming response includes X-CP-Cache-Read-Tokens and X-CP-Cache-Write-Tokens — cache reads cost 0.1× input, so these two headers are your caching savings, per request.

🖥️ Codex CLI

OpenAI Codex CLI works natively with CachePilot. Set up a custom model provider so every Codex session goes through the proxy — policy enforcement, cache optimization, and telemetry included.

1. Set environment variables

CachePilot needs your project key and an OpenAI API key with full api.responses.write scope.

# PowerShell (permanent — survives restarts) [System.Environment]::SetEnvironmentVariable("CACHE_PILOT_KEY", "cp_live_YOUR_KEY", "User") [System.Environment]::SetEnvironmentVariable("OPENAI_API_KEY", "sk-proj-YOUR_KEY", "User") # macOS / Linux export CACHE_PILOT_KEY="cp_live_YOUR_KEY" export OPENAI_API_KEY="sk-proj-YOUR_KEY" # Add to ~/.bashrc or ~/.zshrc to persist

Open a new terminal after setting environment variables. $env:VAR = "value" in PowerShell is session-scoped — it won't persist across terminals.

2. Configure Codex

Add this to your ~/.codex/config.toml:

# ~/.codex/config.toml [profiles.cachepilot] name = "CachePilot" model_provider = "cachepilot" model = "gpt-4o" # or gpt-4.1, o3, etc. [model_providers.cachepilot] name = "CachePilot Proxy" base_url = "https://api.cachepilot.clclabs.ai/v1" wire_api = "responses" # Use your sk-proj key (NOT OAuth) env_key = "OPENAI_API_KEY" # Send your cp_live_ project key env_http_headers = { "X-CachePilot-Key" = "CACHE_PILOT_KEY" }

Important: Use env_key, not requires_openai_auth. The latter uses Codex's OAuth token which may lack the api.responses.write scope needed for the Responses API.

3. Run Codex

codex --profile cachepilot

That's it. Every Codex request now flows through CachePilot with policy enforcement, cache optimization, and full telemetry. Check your Dashboard to see requests in real time.

Supported models

The proxy supports all OpenAI Responses API models via /v1/responses:

gpt-4o gpt-4o-mini gpt-4.1 gpt-4.1-mini gpt-4.1-nano gpt-5 gpt-5-codex gpt-5.1 gpt-5.1-codex gpt-5.1-codex-mini o3 o3-mini o4-mini

…and all Anthropic models via /v1/messages:

claude-fable-5 claude-opus-4-8 claude-opus-4-7 claude-opus-4-6 claude-sonnet-4-6 claude-haiku-4-5

Plan gate. claude-fable-5 is a premium model ($10/$50 per MTok — 2× Opus) and requires a Pro or Team plan. Other plans receive a 403 MODEL_REQUIRES_PLAN before any upstream call — no tokens are spent. All other Claude models are available on every plan.

🛡️ Policies

Each project has a PolicyV1 JSON object that controls what the proxy allows. Policies are enforced deterministically before any request reaches the upstream provider, and every change creates a new immutable version — so you can always reconstruct exactly what was applied to a given request from its policy_version and policy_hash.

{ "version": 1, "openai": { "allow_shell": false, "allowed_skill_ids": [], "default_skill_ids": [], "required_skill_ids": [], "deny_skill_ids": [] }, "output_budget": { "mode": "CLAMP", "default_max_output_tokens": 4096, "hard_max_output_tokens": 16384, "min_max_output_tokens": 100, "allow_request_override": false }, "prefix_cache": { "mode": "AUTO", "instructions_splitting": true, "spine_field": null, "prompt_cache_key_source": "PREFIX_HASH" }, "output_cap_instruction": { "enabled": false, "instruction": "" }, "telemetry": { "store_prompts": false, "store_outputs": false } }

Shell Access

openai.allow_shell — When false, any request containing {"type":"shell"} or {"type":"computer_use_preview"} is rejected with HTTP 403 before it reaches OpenAI. The telemetry row records shell_requested=true, shell_denied=true.

Skills

Four arrays of skill IDs combine into a deterministic contract:

applied = ((requested ∪ default) ∩ allowed) − deny ∪ required

Field	Behavior
`allowed_skill_ids`	Whitelist. Empty = no restriction; non-empty = only these can run.
`deny_skill_ids`	Blocklist. Always removed from the final set.
`default_skill_ids`	Applied when the client doesn't request any skills.
`required_skill_ids`	Always injected, even if the client didn't ask for them.

See the Skills section for the catalog of built-in skill IDs.

Output Budget

Controls the max_output_tokens parameter forwarded to OpenAI.

Field	Behavior
`mode`	`PASS_THROUGH` — client value untouched · `DEFAULT` — use policy default if client omits it · `CLAMP` — client value wins but is clamped to [`min`, `hard`] · `FIXED` — always override to policy default.
`default_max_output_tokens`	Used when the client omits a value, or as the fixed override.
`hard_max_output_tokens`	Absolute ceiling. Client values above this are clamped.
`min_max_output_tokens`	Absolute floor. Client values below this are raised.
`allow_request_override`	When `false`, request-level values are ignored entirely.

The final value sent upstream is recorded as applied_max_output_tokens on the telemetry row and echoed in X-CP-Output-Budget-Applied.

Prefix Cache

prefix_cache.mode (DISABLED / AUTO / MANUAL) governs how the proxy derives a stable prompt_cache_key for each request. prompt_cache_key_source selects what that key is seeded from (NONE, PREFIX_HASH, or PROJECT_ID). The goal is to maximize upstream cache hits without leaking content into the key; the exact derivation is an implementation detail of the proxy.

Output Cap Instruction

When output_cap_instruction.enabled is true, the proxy appends a short system-level instruction reinforcing the output budget. Useful for models that treat max_output_tokens as advisory.

Telemetry Controls

telemetry.store_prompts and telemetry.store_outputs both default to false. In the default configuration the proxy is fully content-free — only hashes, counts, and operational metadata are persisted.

Immutable versions. Every policy edit creates a new row in project_policies. The policy_hash on each telemetry row is an audit fingerprint — if the policy changed between two requests, you'll see different hashes, and you can always look up the exact JSON that was applied.

🧩 Skills

Skills are the named governance primitives the proxy enforces on every request. Each skill is a small, deterministic contract — the list below is the catalog of built-ins you can reference by ID in a policy's skill arrays.

ID	Contract
`prefix_guard`	Asserts the instructional prefix matches the expected fingerprint.
`seed_lock`	Pins determinism controls (seed / sampling) for reproducible runs.
`output_budget`	Enforces the output-budget policy on every request.
`request_cost_guard`	Rejects requests whose projected cost exceeds the policy ceiling.
`tool_whitelist`	Applies the skills set-algebra contract to the request's tool list.
`tool_schema_enforcer`	Validates tool definitions against the expected schema shape.
`tool_output_sanitizer`	Strips disallowed fields from tool outputs before they reach the model.
`receipt_emitter`	Produces the `X-CP-*` receipt headers on every response.
`hash_redactor`	Guarantees only hashes — never raw content — are persisted to telemetry.
`drift_detector`	Compares the live request against pinned golden runs and emits drift events.

Skills are grouped into tier bundles — core_free, core_starter, and core_pro — which determine which skills are available on your project's plan. See Pricing for bundle membership.

🎯 Drift & Golden Runs

A golden run is a pinned baseline — any request from the dashboard can be promoted, which captures its policy version, prefix hash, and tool/skill fingerprints as the expected state for that route. Subsequent requests on the same route are compared against the baseline, and any mismatch is recorded as a drift_event.

Drift detection is content-free: it only looks at hashes, so it catches structural changes without ever needing to see your prompts or outputs.

Drift type	Meaning
`PREFIX_CHANGED`	The instructional prefix hash no longer matches the golden run — prompt has shifted.
`POLICY_CHANGED`	A different policy version was applied than the one pinned on the baseline.
`SKILL_CHANGED`	The applied skills hash differs from the golden run — the governance envelope changed.

Drift events surface in the dashboard's Determinism and Golden Runs tabs, so you can spot silent prompt or policy regressions before they hit production traffic.

📋 Receipt Headers

Every response from the proxy includes X-CP-* headers. These are your policy receipt — proof of what was enforced, without storing any content.

Header	Example	Description
`X-CP-Receipt-Id`	`b7a3...f2e1`	UUID of this telemetry row
`X-CP-Policy-Version`	`1`	Policy schema version that was applied
`X-CP-Output-Budget-Applied`	`4096`	Actual max_output_tokens sent to OpenAI
`X-CP-Skills-Applied-Hash`	`a3f8...c1e2`	SHA-256 of the final tool set after policy
`X-CP-Prefix-Hash`	`7b2d...f491`	SHA-256 of instructional prefix (proves prompt stability)

< HTTP/2 200 < X-CP-Receipt-Id: b7a3c8e2-1f4d-4a9b-8c6e-3d5f7a2b1e09 < X-CP-Policy-Version: 1 < X-CP-Output-Budget-Applied: 4096 < X-CP-Skills-Applied-Hash: a3f829c1e2b4d6f8 < X-CP-Prefix-Hash: 7b2df491a8c3e5d7

📊 Telemetry Fields

Every proxied request writes a content-free telemetry row to Postgres. No prompts, no outputs — only operational metadata.

Field	Type	Description
`request_ts`	timestamp	When the request arrived
`model`	text	Model name (e.g. gpt-4.1)
`stream`	boolean	Was this a streaming request?
`prefix_hash`	text	SHA-256 of instructional prefix (16 hex)
`skills_hash`	text	Hash of the applied skills set
`toolset_hash`	text	Hash of the canonicalized tool list
`schema_hash`	text	Hash of the response-format schema, if any
`policy_version`	int	Policy version applied
`policy_hash`	text	SHA-256 of the full policy JSON
`applied_max_output_tokens`	int	Output budget after enforcement
`reasoning_effort`	text	Reasoning effort for o-series models
`prompt_cache_key`	text	Stable key forwarded to upstream for cache routing
`allow_shell`	boolean	Policy shell setting
`shell_requested`	boolean	Did the client request a shell tool?
`shell_denied`	boolean	Was the shell request blocked?
`http_status`	int	Response status code
`upstream_request_id`	text	OpenAI's request ID (for support)
`latency_ms_total`	int	Total round-trip latency
`latency_ms_api`	int	OpenAI API call latency
`retry_429_count`	int	Number of 429 retries
`error_code`	text	Error code (null if success)

Usage row (per request):

Field	Type	Description
`input_tokens`	int	Total input tokens
`output_tokens`	int	Total output tokens
`cached_tokens`	int	Input tokens served from cache
`uncached_tokens`	int (generated)	`input_tokens − cached_tokens`
`reasoning_tokens`	int	Hidden reasoning tokens billed on o-series / GPT-5 models
`cache_source`	text	`upstream` (OpenAI), `proxy`, or `engine` — where the cache hit came from

🚀 Deployment

Proxy runs on our infra (BYOK). You bring your own OpenAI API key — it passes through untouched. We never store prompts or outputs. Only content-free telemetry metadata is persisted.

The CachePilot proxy terminates TLS via Caddy. Your API key never touches our storage layer — it's forwarded to OpenAI in-memory during each request.

Component	Stack
Proxy	Node.js / Express, Docker
TLS	Caddy (auto Let's Encrypt)
Database	Neon Postgres (serverless)
Dashboard	Next.js on Vercel

Endpoint: https://api.cachepilot.clclabs.ai/v1/responses

Health check: https://api.cachepilot.clclabs.ai/health