Day 3 of 16 · Tuesday · Learning

Token economicsCapability

What actually costs money when you run Claude — output tokens, model tier, and whether your prompt cache is still warm — and how to spot the overspend hiding in your own scheduled jobs.

Catch-up progress

3/16

Why this matters to you

Every morning a stack of your LaunchAgents fire `claude -p` cold: plan-day, classify, the language + learning lessons. Each one re-reads your entire CLAUDE.md context from scratch — a guaranteed cache miss — and some of them run Opus on work a much cheaper model would nail identically. That's real money leaking on a timer.

You pay per token, both directions: tokens going in (your CLAUDE.md, the skill file, the input data) and tokens coming out (the response). The catch most people miss is that output is roughly 5× the price of input. So a tight prompt that produces a rambling answer costs more than a long prompt that produces a crisp one — the lever is the response length, not the instructions.

The second lever is the prompt cache. Anthropic will cache your big static context (the CLAUDE.md, the skill) and a cache hit costs about a tenth of a cache miss. But the cache lives for only ~5 minutes. Back-to-back turns in a live session ride a warm cache and stay cheap; a job that sleeps longer than 5 minutes between steps — or a cron that spawns a brand-new process every morning — re-reads everything at full price. This is the exact reason the `/loop` and ScheduleWakeup tooling warns you off 300-second sleeps: just past the cache window is the worst place to wait.

The third lever is model tier. Opus is your most capable and most expensive model; Haiku is a fraction of the cost. For judgment-heavy work (writing your day's narrative, composing a lesson) Opus earns its keep. For deterministic sorting — like the classify step that buckets a dropped file into S/M/L tier — a smaller model produces the same answer for a fifteenth of the price. Running Opus on a sort is the most common quiet overspend in your whole system.

Worked example

Find out what your scheduled jobs actually cost to load, before they do a single useful thing:

# Which model does each scheduled claude -p job run?
grep -ri "claude-\|--model\|opus\|haiku" /Users/tom/Claude/PDB/scripts/run_plan_day.sh

# Rough card rates (per million tokens):
#   Opus  ~ $15 in / $75 out
#   Haiku ~ $1  in / $5  out
#
# A plan-day run loads ~40k tokens of context (CLAUDE.md + skill + inputs)
# every morning, cold cache, before writing anything:
#   40,000 / 1,000,000 * $15  =  $0.60 just to read the room
#
# The same 40k on a warm cache (a follow-up turn within 5 min):
#   ~ $0.06  — one tenth.

The grep shows whether the job pins a model or inherits the default — if plan-day is on Opus, that 40k load is $0.60 of input every single morning before it produces a word.
The two rate lines are the whole pricing story: output is the expensive direction, and the gap between Opus and Haiku is ~15×.
The cold-vs-warm comparison is the cache tax — the same tokens cost 10× more the moment the 5-minute window lapses, which a once-a-morning cron always pays.

▶ Do it now

Run `/benchmark-models` on one prompt you fire daily — the classify prompt is a great pick. It runs the same input through Claude, GPT, and Gemini and reports latency, token counts, and cost side by side.
Read the input-token count it reports, multiply by the Opus input rate ($15/M), and again by the Haiku rate ($1/M). That difference is your per-run overspend if the task doesn't need Opus-level judgment.
Open the classify cron script and find the model flag. If it's running a deterministic tier-sort on the big model, that's the single highest-leverage swap in your system — note it as a one-line change and watch the next classify run produce the identical card for a fraction of the cost.

Gotchas

The cache TTL is wall-clock 5 minutes, not per-session — a `/loop` that sleeps 6 minutes pays full freight on every wake, so either stay under ~270s or commit to a much longer wait that amortizes the miss.
A verbose system prompt is cheap (input, cached); a verbose response is not (output, never cached). Trim what the model says before you trim what you tell it.
"Bigger model is safer" is false for deterministic work — reserve Opus for tasks where judgment varies the output, not where the answer is mechanical.

Go deeper: Anthropic — API pricing · Anthropic — Prompt caching docs

One-card takeaway

The two cheapest wins are picking the smallest model that still does the job and keeping the cache warm — both beat any amount of prompt-trimming.

← Prev Index Next →