How Comis cuts LLM costs by 81%

DAG-backed recovery, deterministic compaction, cache fences, sub-agent spawn staggering, and progressive tool disclosure keep long-running agents useful without sending every old token back to the model.

16.9x

Cache read/write ratio

94%

Tokens served from cache

93%

Writes are cold-start only

$0.50

Per MTok cached reads

The problem

Context windows are expensive.

System prompts are large

Identity, instructions, tool definitions, workspace files, and security guardrails easily reach 30-60K tokens.

Conversations grow fast

Tool results (file reads, web fetches, API responses) can be 10-50K tokens each. A 30-turn session with tool use can cost $5-15.

Multi-agent pipelines multiply costs

A 7-agent stock analysis pipeline costs 7x the base rate per message. Without optimization, this becomes prohibitive.

Cache misses are silent killers

Anthropic charges 10-20x more for cache writes ($6.25-10/MTok) than reads ($0.50/MTok). A single modified message in the cached prefix invalidates the entire KV cache.

The solution

Two context paths, one goal: keep useful detail recoverable

Comis combines deterministic compaction with DAG-backed recovery. Operators can choose the context engine (contextEngine.version) and switch with a config reload. The same token-budgeting, eviction, masking, and compaction discipline applies in both paths.

Recoverable

Lossless DAG context engine

Older detail can be folded into zoomable summaries and recalled on demand via ctx_search, ctx_inspect, and ctx_expand. Compression stays inspectable in-session instead of becoming a one-way summary.

Deterministic

Ten-layer deterministic pipeline

The classic multi-stage pipeline detailed below runs through 8 categorical stages before LLM calls - 10 distinct layer implementations counting signature scrubbing and surrogate guards - each targeting a specific source of token waste. Every stage has a circuit breaker, so no single optimization bug can bring down the pipeline.

Thinking Block Cleaner

Every call

Strips extended thinking traces from older turns (default: 10 turns), keeping recent reasoning while reclaiming tokens from stale deliberation.

Reasoning Tag Stripper

Every call

Strips inline reasoning tags (<think>, <thinking>, <thought>, <antThinking>) from non-Anthropic provider responses persisted in session history. Always active regardless of the current model's reasoning capabilities, since sessions may contain messages from multiple providers.

History Window

Every call

Caps conversation history to the last N user turns (default: 15, configurable per channel). Pair-safe: never splits a tool-call/tool-result pair. Compaction summaries always preserved as anchors.

Dead Content Evictor

Every call

Uses forward-index O(n) analysis to detect provably superseded tool results. If an agent read a file at turn 5 and again at turn 20, the turn-5 result is replaced with a 50-byte placeholder. Tracks 5 categories: file_read, exec, web, image, error.

Observation Masker

Context > 120K chars

Three-tier masking when total context exceeds 120K characters. Protected tools (memory, file reads) are never masked. Standard tools use a keep window (default: 25 most recent). Ephemeral tools (web searches, fetches) get a shorter keep window (default: 10). Masked entries persist to disk for stable cache prefixes. Hysteresis prevents oscillation: activates at 120K, deactivates below 80K.

LLM Compaction

Context > 85% of window

Last-resort compression when context exceeds 85% of the model window. Compresses 50+ messages into a structured 9-section summary using a cheaper model (Haiku by default). Three-tier fallback: full summarization, filtered summarization, count-only note.

Rehydration

After compaction

After compaction, strategically re-injects only what was lost: workspace instructions (AGENTS.md, max 3K chars), recently-accessed files (max 5 files, 8K each), and a resume instruction for seamless continuation.

Objective Reinforcement

After compaction

For sub-agents, re-injects the original task objective after compaction so delegated tasks stay on track even through context compression.

Prompt caching

Cache-stable prompt architecture

Anthropic's prompt caching cuts costs 7.5x - but only if the system prompt stays identical across turns. Most frameworks embed timestamps or channel metadata in the system prompt, silently invalidating the cache on every message.

Static

System Prompt

Content

Identity, personality, workspace files, tool definitions, security rules

Cache behavior

Cached - paid once at write rate, then $0.50/MTok on every subsequent call

Per-turn

Dynamic Preamble

Content

Timestamp, sender metadata, channel context, RAG results, active skills, trust entries

Cache behavior

Prepended to user message - never invalidates the cache prefix

Six categories kept out of the system prompt

Date/time, inbound message metadata, channel context, RAG memory results, active skill content, and sender trust entries. Each would invalidate the entire cache prefix if left inline (CACHE-01 through CACHE-06).

Cache fence system

Protecting the cached prefix across turns

Without protection, context engine layers modify messages within the cached region - stripping thinking blocks, masking tool results, evicting dead content. Each modification invalidates Anthropic's KV cache, forcing expensive re-writes. The cache fence prevents this.

Active cache optimization, not passive configuration

Most agent platforms treat caching as a provider feature to configure - a single cache_control tag on the system message. Comis treats the Anthropic cache as a target architecture to optimize for. The context engine pipeline is cache-aware: it knows where the cache boundary is and actively prevents modifications within the cached region.

breakpoint callback → fence index → layer guards → trim offset translation
         ↑                                                    ↓
         ←---- persisted across execute() calls ----→

Multiple feedback loops ensure cache stability across turns, achieving a 16.9x read/write ratio in production - meaning for every token written to cache, roughly 17 tokens are served from it at 10x lower cost.

Cache fence feedback loop

CACHE-20

Cache Fence Index

Tracks the highest cache breakpoint position from each LLM call. On the next turn, context engine layers skip all messages at or below this index - preventing modifications that would invalidate the cached prefix.

Eliminated 36K+ tokens of unnecessary cache invalidation per pipeline turn

CACHE-20

Trim Offset Translation

The history window trims 100+ messages from the front of the session, shifting all indices. The fence index is stored in pre-trim space and correctly adjusted after trimming so protection survives across turns.

Fixed fence being zeroed out by 97-120 message trims

CACHE-21

Sub-Agent Spawn Staggering

Concurrent sub-agents in a pipeline wave are staggered by 4 seconds. The first sub-agent populates the shared cache prefix (system prompt + tools), then siblings read it instead of each paying the write cost.

49K avg cache reads on sub-agent first turn (shared prefix)

CACHE-23

Adaptive Cold-Start Retention

Parent agents write at 1h TTL from the first call so the system prompt survives pipeline gaps (>5m). Sub-agents inherit mixed 5m/1h TTLs - shared prefix blocks use the first writer's TTL, conversation-specific content uses 5m. Cache refreshes don't upgrade TTL, so the initial write determines the cache lifetime.

Eliminates re-writes per pipeline gap, so the cache survives across waves at steady state.

CACHE-23

TTL Monotonicity Enforcement

Anthropic's API requires cache breakpoint TTLs to be non-increasing (system >= tools >= messages). The SDK sets system to 5m, but tool breakpoints escalate to 1h. Comis upgrades system block TTLs in the onPayload hook to satisfy the constraint, preventing silent downgrades.

Unlocked 1h cache writes for the first time. Without this, all 1h TTLs were silently downgraded to 5m.

CACHE-20/23

Breakpoint Threshold Tuning

Sub-agents use 512-token minimum for breakpoint placement. Parent agents use 1,024 tokens (lowered from 4,096). Enables message-level cache breakpoints in conversational exchanges where individual messages are 500-2000 tokens.

Sub-agent cache ratio improved from 3.0x to 4.7x. Parent agent gained 1 message breakpoint per call (was 0).

Cache diagnostics

Cache break detection and attribution

When cache invalidation happens, knowing why is critical. Comis uses a two-phase detection system that records pre-call state and performs post-call analysis to attribute every cache break to its root cause.

Phase 1: Pre-call snapshot

Records SHA-256 hashes of system prompt, tool schemas, and cache_control metadata before the LLM call. These fingerprints establish the baseline for comparison.

Phase 2: Post-call analysis

After the API response, compares actual cache write tokens against expected values. Dual threshold: >5% relative AND >2K absolute tokens triggers attribution.

Attribution priority chain

When a cache break is detected, the system walks a priority chain to identify the cause:

model change → system prompt → tool schemas → retention change → metadata → TTL expiry → server eviction

Lazy content diffing: full serialization only runs when a break is actually detected, keeping hot-path overhead near zero.

Multi-provider caching

Gemini explicit caching via CachedContent API

Beyond Anthropic's prompt cache, Comis implements explicit caching for Google's Gemini models using the CachedContent API - a guaranteed 90% discount on cached tokens.

Cache Manager

Full CachedContent lifecycle: create, reuse, refresh, and dispose. SHA-256 content hashing detects changes. Concurrent request deduplication prevents duplicate cache entries.

Per-model thresholds

Gemini Flash requires 1,024 minimum cacheable tokens. Gemini Pro requires 4,096. Below-minimum requests fall through to uncached calls automatically.

Lifecycle management

Session expiry events trigger cache disposal. Orphaned caches (from crashes or disconnects) are detected and cleaned up automatically.

Provider isolation

Anthropic's requestBodyInjector and Google's geminiCacheInjector are mutually exclusive at runtime via provider guards. The Gemini injector atomically strips inherited fields (systemInstruction, tools, toolConfig) when a CachedContent name is present, since these are already baked into the cache.

Pipeline caching

How 8 agents share a single cache prefix

In a multi-agent pipeline, each sub-agent shares the same system prompt and tool definitions. Without staggering, all agents would write the same content simultaneously - paying the write cost 8 times. With staggered spawning, the first agent writes once and the rest read at 10x lower cost.

Staggered sub-agent spawning (4s intervals)

Unstaggered fan-out

$22.51

4M prompt tokens at input rate. No cache sharing. Each agent pays the full write cost for the identical system prompt.

With spawn staggering

$12.48

4.9M tokens served from cache reads at $0.50/MTok. The first agent writes the shared prefix, siblings read it instead of each writing it again, and the fence prevents invalidation - roughly half the cost of an unstaggered fan-out. (The full cached-vs-uncached session comparison is in the production-data section below.)

Anthropic pricing

Why cache reads matter

Cache reads are 10-20x cheaper than writes and 10x cheaper than base input. Every token shifted from write to read directly reduces cost.

Model	Input	5m Write	1h Write	Cache Read	Saving
Claude Opus 4.6	$5.00	$6.25	$10.00	$0.50	10x cheaper
Claude Sonnet 4.6	$3.00	$3.75	$6.00	$0.30	10x cheaper
Claude Haiku 4.5	$1.00	$1.25	$2.00	$0.10	10x cheaper

Prices per million tokens, from Anthropic's published pricing. Verify the current rates before relying on these figures.

Write-time optimization

Microcompaction

Tool results are intercepted at write time. Oversized results are offloaded to disk and replaced with lightweight references. The agent can re-read if needed.

file_read 15,000 chars

MCP tools 15,000 chars

Default tools 8,000 chars

Hard cap 100,000 chars

Tool optimization

Lean tool definitions

Most frameworks send verbose tool descriptions on every turn - usage guides, action lists, workspace paths. Comis uses lean, structured descriptions optimized for tool selection, with detailed guidance delivered just-in-time on first use.

Lean contracts - every tool gets a structured description under 150 characters, optimized for selection accuracy.

Just-in-time guides - detailed usage instructions delivered in the tool response on first use, not loaded upfront.

Progressive disclosure

Three-tier tool context management

A capable agent carries a lot of tools - its own built-ins plus whatever you connect from the MCP ecosystem's 50+ servers - none bundled, you choose. Their definitions can consume more context than the conversation itself. Comis uses a three-tier system that loads tool information progressively based on relevance.

Research shows fewer, better-described tools outperform large tool sets. Disambiguation between similar tools yields 8-38% accuracy gains.

Tier 1 Always present

Lean definitions

Every tool gets a short, structured description optimized for selection - not usage. Confusable tool pairs include explicit disambiguation suffixes so the model picks the right one.

~20 tokens/tool vs ~240 typical

Tier 2 On first use

Just-in-time guidance

Detailed usage instructions, workspace guides, and action requirements are injected into the tool response the first time a tool is called. Delivered exactly when needed, paid only once per session.

~0 tokens upfront loaded on demand

Tier 3 Context-dependent

Deferred loading

Tools irrelevant to the current context - wrong platform, insufficient trust, rarely used - are deferred behind a semantic discovery tool. Admin tools defer for non-admin sessions. Discord tools defer on Telegram.

0 tokens until discovered

Model-aware tool presentation

Small models (8B-70B) get aggressive deferral, pruned parameter schemas, and a focused core tool set. Large models get full schemas with lean descriptions. The model tier is resolved automatically from the context window size - no configuration needed.

Cost control

Three-tier budget guard

Every agent has three budget caps, checked before the LLM call - not after the money is spent.

200K tokens

Per-execution

Prevent runaway single calls

1M tokens

Per-hour

Rate limiting

5M tokens

Per-day

Daily cost ceiling

Production data

Real-world cost profile

From a production session running an 8-agent stock analysis pipeline with Claude Opus 4.6 (TradingAgents: 4 analysts, bull-bear debate, trader, risk management, portfolio manager).

Operation	Cost	Notes
First "Hello"	$0.25	Cold cache - full system prompt write (Opus)
Second message	$0.03	Warm cache - fence protects prefix
Subsequent messages	$0.04-0.09	Stable cache reads, fence active
8-agent trading pipeline	~$2.11	4 analysts + debate + trader + risk + PM (788K tokens, 70% cache effectiveness)
Pipeline without optimizations	~$22.51	Same pipeline, no caching

What spawn staggering alone saves

Run the same 8-agent fan-out with every sibling writing the identical prefix and it costs $22.51 - 4M prompt tokens at the full input rate. Stagger the spawns so the first agent writes once and the rest read at $0.50/MTok and it drops to $12.48, roughly half. The cache fence on top of that prevented 36K+ tokens of unnecessary re-writes per pipeline turn by protecting the cached prefix from context engine modifications - the full pipeline lands at $2.11 (788K tokens).

Observability

Every token accounted for

Every pipeline run logs structured metrics. Events fire on every significant action, feeding into the observability dashboard for real-time cost monitoring.

{
  "tokensLoaded": 40112,
  "tokensEvicted": 0,
  "tokensMasked": 0,
  "tokensCompacted": 0,
  "thinkingBlocksRemoved": 0,
  "budgetUtilization": 0.05,
  "cacheFenceIndex": 54,
  "sessionDepth": 188,
  "sessionToolResults": 50,
  "durationMs": 1
}

cacheFenceIndex: 54 means messages 0-54 are protected from modification. thinkingBlocksRemoved: 0 confirms the fence is preventing unnecessary stripping. Budget utilization at 5% with 188-message depth shows the history window and context layers are keeping costs controlled even in long sessions.

The full picture.

20+ mechanisms working together to keep costs predictable at any scale.

Mechanism	What it saves	When it fires
Thinking cleaner	Old reasoning traces	Every call (layer 1)
Reasoning tag stripper	Non-Anthropic inline reasoning tags	Every call (layer 2)
History window	Old conversation turns	Every call (layer 3)
Dead content evictor	Superseded file reads, exec results	Every call (layer 4)
Observation masker	Old tool outputs (3-tier: protected/standard/ephemeral)	Context > 120K chars (layer 5)
LLM compaction	Entire conversation → 9-section summary	Context > 85% of window (layer 6)
Rehydration	Re-injects critical context	After compaction (layer 7)
Objective reinforcement	Sub-agent goals survive compaction	After compaction (layer 8)
Cache fence	Prevents cache prefix invalidation	Every call (all layers)
Trim offset translation	Fence survives history-window	After layer 3
Spawn staggering	Shared cache prefix across sub-agents	Pipeline wave start
Tiered cache retention	5m for sub-agents, 1h for parent	Per-session
Cache-stable prompts	Dynamic fields out of system prompt	Every call after first
Cache break detection	Detects and attributes cache invalidation causes	Post-LLM call
Gemini explicit caching	CachedContent API with SHA-256 hashing	Gemini providers
Prefix instability detection	Forces short TTL on stuck cache reads	Per-session
Microcompaction	Large tool results offloaded	At write time
Re-read detector	Identifies duplicate tool calls	Every call
Lean tool definitions	88% of tool description tokens	Every call
Just-in-time guidance	Upfront instructional overhead	On first tool use
Tool deferral	Irrelevant tool schemas	Context-dependent
Model-aware presentation	Schema overhead on small models	Auto by model tier
Budget guard	Prevents runaway calls	Pre-call estimation

Real production data

Anthropic dashboard: April 11, 2026

Actual token usage and costs from a production Comis instance running multi-agent pipelines on Claude Opus 4.6. Not a benchmark - real user traffic. This 76-call session cost $5.02 with caching versus $26.42 without - 81% cheaper. That is the 81% this page leads with.

Token breakdown

Input 62

Output 20,662

Cache write (1h) 289,900

Cache read 4,892,893

Total 5,213,517

Cache read (94%) Cache write (6%)

Cost breakdown (Opus 4.6)

Input $5/MTok

$0.00

Output $25/MTok

$0.52

Cache write mixed 5m/1h TTL

$2.06

Cache read $0.50/MTok

$2.45

Total $5.02

16.9x

Read/write ratio

81%

Cost savings

$26.42

Without caching

Production agents with predictable costs.

Run agents indefinitely, across thousands of turns, without the exponential token growth that makes naive frameworks unusable at scale.

Get Started Full context-management docs → View on GitHub Back to Home