How Comis cuts LLM costs by 81%
8-layer context engine, cache fence system, sub-agent spawn staggering, progressive tool disclosure, and 20 mechanisms - so your agents run indefinitely with predictable costs.
17x
Cache read/write ratio
94%
Tokens served from cache
93%
Writes are cold-start only
$0.50
Per MTok cached reads
The problem
Context windows are expensive.
System prompts are large
Identity, instructions, tool definitions, workspace files, and security guardrails easily reach 30-60K tokens.
Conversations grow fast
Tool results (file reads, web fetches, API responses) can be 10-50K tokens each. A 30-turn session with tool use can cost $5-15.
Multi-agent pipelines multiply costs
A 7-agent stock analysis pipeline costs 7x the base rate per message. Without optimization, this becomes prohibitive.
Cache misses are silent killers
Anthropic charges 10-20x more for cache writes ($6.25-10/MTok) than reads ($0.50/MTok). A single modified message in the cached prefix invalidates the entire KV cache.
The solution
8-layer context engine pipeline
Every conversation is processed through a composable pipeline of 8 layers before each LLM call. Each layer targets a specific source of token waste.
Every layer has a circuit breaker - 3 consecutive failures disables the layer. No single optimization bug can bring down the pipeline.
Thinking Block Cleaner
Every callStrips extended thinking traces from older turns (default: 10 turns), keeping recent reasoning while reclaiming tokens from stale deliberation.
Reasoning Tag Stripper
Every callStrips inline reasoning tags (<think>, <thinking>, <thought>, <antThinking>) from non-Anthropic provider responses persisted in session history. Always active regardless of the current model's reasoning capabilities, since sessions may contain messages from multiple providers.
History Window
Every callCaps conversation history to the last N user turns (default: 15, configurable per channel). Pair-safe: never splits a tool-call/tool-result pair. Compaction summaries always preserved as anchors.
Dead Content Evictor
Every callUses forward-index O(n) analysis to detect provably superseded tool results. If an agent read a file at turn 5 and again at turn 20, the turn-5 result is replaced with a 50-byte placeholder. Tracks 5 categories: file_read, exec, web, image, error.
Observation Masker
Context > 120K charsThree-tier masking when total context exceeds 120K characters. Protected tools (memory, file reads) are never masked. Standard tools use a keep window (default: 25 most recent). Ephemeral tools (web searches, fetches) get a shorter keep window (default: 10). Masked entries persist to disk for stable cache prefixes. Hysteresis prevents oscillation: activates at 120K, deactivates below 80K.
LLM Compaction
Context > 85% of windowLast-resort compression when context exceeds 85% of the model window. Compresses 50+ messages into a structured 9-section summary using a cheaper model (Haiku by default). Three-tier fallback: full summarization, filtered summarization, count-only note.
Rehydration
After compactionAfter compaction, strategically re-injects only what was lost: workspace instructions (AGENTS.md, max 3K chars), recently-accessed files (max 5 files, 8K each), and a resume instruction for seamless continuation.
Objective Reinforcement
After compactionFor sub-agents, re-injects the original task objective after compaction so delegated tasks stay on track even through context compression.
Prompt caching
Cache-stable prompt architecture
Anthropic's prompt caching cuts costs 7.5x - but only if the system prompt stays identical across turns. Most frameworks embed timestamps or channel metadata in the system prompt, silently invalidating the cache on every message.
System Prompt
Content
Identity, personality, workspace files, tool definitions, security rules
Cache behavior
Cached - paid once at write rate, then $0.50/MTok on every subsequent call
Dynamic Preamble
Content
Timestamp, sender metadata, channel context, RAG results, active skills, trust entries
Cache behavior
Prepended to user message - never invalidates the cache prefix
Six categories kept out of the system prompt
Date/time, inbound message metadata, channel context, RAG memory results, active skill content, and sender trust entries. Each would invalidate the entire cache prefix if left inline (CACHE-01 through CACHE-06).
Cache fence system
Protecting the cached prefix across turns
Without protection, context engine layers modify messages within the cached region - stripping thinking blocks, masking tool results, evicting dead content. Each modification invalidates Anthropic's KV cache, forcing expensive re-writes. The cache fence prevents this.
Active cache optimization, not passive configuration
Most agent platforms treat caching as a provider feature to configure - a single cache_control tag on the system message. Comis treats the Anthropic cache as a target architecture to optimize for. The context engine pipeline is cache-aware: it knows where the cache boundary is and actively prevents modifications within the cached region.
breakpoint callback → fence index → layer guards → trim offset translation
↑ ↓
←———— persisted across execute() calls ————→ Multiple feedback loops ensure cache stability across turns, achieving a 16.9x read/write ratio in production - meaning for every token written to cache, 17 tokens are served from it at 10x lower cost.
Cache fence feedback loop
Cache Fence Index
Tracks the highest cache breakpoint position from each LLM call. On the next turn, context engine layers skip all messages at or below this index - preventing modifications that would invalidate the cached prefix.
Eliminated 36K+ tokens of unnecessary cache invalidation per pipeline turn
Trim Offset Translation
The history window trims 100+ messages from the front of the session, shifting all indices. The fence index is stored in pre-trim space and correctly adjusted after trimming so protection survives across turns.
Fixed fence being zeroed out by 97-120 message trims
Sub-Agent Spawn Staggering
Concurrent sub-agents in a pipeline wave are staggered by 4 seconds. The first sub-agent populates the shared cache prefix (system prompt + tools), then siblings read it instead of each paying the write cost.
49K avg cache reads on sub-agent first turn (shared prefix)
Adaptive Cold-Start Retention
Parent agents write at 1h TTL from the first call so the system prompt survives pipeline gaps (>5m). Sub-agents inherit mixed 5m/1h TTLs - shared prefix blocks use the first writer's TTL, conversation-specific content uses 5m. Cache refreshes don't upgrade TTL, so the initial write determines the cache lifetime.
Eliminates re-writes per pipeline gap. 81% cost reduction at steady state.
TTL Monotonicity Enforcement
Anthropic's API requires cache breakpoint TTLs to be non-increasing (system >= tools >= messages). The SDK sets system to 5m, but tool breakpoints escalate to 1h. Comis upgrades system block TTLs in the onPayload hook to satisfy the constraint, preventing silent downgrades.
Unlocked 1h cache writes for the first time. Without this, all 1h TTLs were silently downgraded to 5m.
Breakpoint Threshold Tuning
Sub-agents use 512-token minimum for breakpoint placement. Parent agents use 1,024 tokens (lowered from 4,096). Enables message-level cache breakpoints in conversational exchanges where individual messages are 500-2000 tokens.
Sub-agent cache ratio improved from 3.0x to 4.7x. Parent agent gained 1 message breakpoint per call (was 0).
Cache diagnostics
Cache break detection and attribution
When cache invalidation happens, knowing why is critical. Comis uses a two-phase detection system that records pre-call state and performs post-call analysis to attribute every cache break to its root cause.
Phase 1: Pre-call snapshot
Records SHA-256 hashes of system prompt, tool schemas, and cache_control metadata before the LLM call. These fingerprints establish the baseline for comparison.
Phase 2: Post-call analysis
After the API response, compares actual cache write tokens against expected values. Dual threshold: >5% relative AND >2K absolute tokens triggers attribution.
Attribution priority chain
When a cache break is detected, the system walks a priority chain to identify the cause:
Lazy content diffing: full serialization only runs when a break is actually detected, keeping hot-path overhead near zero.
Multi-provider caching
Gemini explicit caching via CachedContent API
Beyond Anthropic's prompt cache, Comis implements explicit caching for Google's Gemini models using the CachedContent API - a guaranteed 90% discount on cached tokens.
Cache Manager
Full CachedContent lifecycle: create, reuse, refresh, and dispose. SHA-256 content hashing detects changes. Concurrent request deduplication prevents duplicate cache entries.
Per-model thresholds
Gemini Flash requires 1,024 minimum cacheable tokens. Gemini Pro requires 4,096. Below-minimum requests fall through to uncached calls automatically.
Lifecycle management
Session expiry events trigger cache disposal. Orphaned caches (from crashes or disconnects) are detected and cleaned up automatically.
Provider isolation
Anthropic's requestBodyInjector and Google's geminiCacheInjector are mutually exclusive at runtime via provider guards. The Gemini injector atomically strips inherited fields (systemInstruction, tools, toolConfig) when a CachedContent name is present, since these are already baked into the cache.
Pipeline caching
How 8 agents share a single cache prefix
In a multi-agent pipeline, each sub-agent shares the same system prompt and tool definitions. Without staggering, all agents would write the same content simultaneously - paying the write cost 8 times. With staggered spawning, the first agent writes once and the rest read at 10x lower cost.
Staggered sub-agent spawning (4s intervals)
Without staggering
$22.51
4M prompt tokens at input rate. No cache sharing. Each agent pays full write cost for identical system prompts.
With Comis cache system
$12.48
4.9M tokens from cache reads at $0.50/MTok. First agent writes, siblings read. Fence prevents invalidation. 81% cost reduction.
Anthropic pricing
Why cache reads matter
Cache reads are 10-20x cheaper than writes and 10x cheaper than base input. Every token shifted from write to read directly reduces cost.
| Model | Input | 5m Write | 1h Write | Cache Read | Saving |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $6.25 | $10.00 | $0.50 | 10x cheaper |
| Claude Sonnet 4.6 | $3.00 | $3.75 | $6.00 | $0.30 | 10x cheaper |
| Claude Haiku 4.5 | $1.00 | $1.25 | $2.00 | $0.10 | 10x cheaper |
Prices per million tokens. Source: Anthropic pricing page, June 2025.
Write-time optimization
Microcompaction
Tool results are intercepted at write time. Oversized results are offloaded to disk and replaced with lightweight references. The agent can re-read if needed.
file_read 15,000 chars MCP tools 15,000 chars Default tools 8,000 chars Hard cap 100,000 chars Tool optimization
Lean tool definitions
Most frameworks send verbose tool descriptions on every turn - usage guides, action lists, workspace paths. Comis uses lean, structured descriptions optimized for tool selection, with detailed guidance delivered just-in-time on first use.
Lean contracts - every tool gets a structured description under 150 characters, optimized for selection accuracy.
Just-in-time guides - detailed usage instructions delivered in the tool response on first use, not loaded upfront.
Progressive disclosure
Three-tier tool context management
With 50+ built-in tools and unlimited MCP integrations, tool definitions can consume more context than the conversation itself. Comis uses a three-tier system that loads tool information progressively based on relevance.
Research shows fewer, better-described tools outperform large tool sets. Disambiguation between similar tools yields 8-38% accuracy gains.
Lean definitions
Every tool gets a short, structured description optimized for selection - not usage. Confusable tool pairs include explicit disambiguation suffixes so the model picks the right one.
Just-in-time guidance
Detailed usage instructions, workspace guides, and action requirements are injected into the tool response the first time a tool is called. Delivered exactly when needed, paid only once per session.
Deferred loading
Tools irrelevant to the current context - wrong platform, insufficient trust, rarely used - are deferred behind a semantic discovery tool. Admin tools defer for non-admin sessions. Discord tools defer on Telegram.
Model-aware tool presentation
Small models (8B-70B) get aggressive deferral, pruned parameter schemas, and a focused core tool set. Large models get full schemas with lean descriptions. The model tier is resolved automatically from the context window size - no configuration needed.
Cost control
Three-tier budget guard
Every agent has three budget caps, checked before the LLM call - not after the money is spent.
200K tokens
Per-execution
Prevent runaway single calls
1M tokens
Per-hour
Rate limiting
5M tokens
Per-day
Daily cost ceiling
Production data
Real-world cost profile
From a production session running an 8-agent stock analysis pipeline with Claude Opus 4.6 (TradingAgents: 4 analysts, bull-bear debate, trader, risk management, portfolio manager).
| Operation | Cost | Notes |
|---|---|---|
| First "Hello" | $0.25 | Cold cache - full system prompt write (Opus) |
| Second message | $0.03 | Warm cache - fence protects prefix |
| Subsequent messages | $0.04-0.09 | Stable cache reads, fence active |
| 8-agent trading pipeline | ~$2.11 | 4 analysts + debate + trader + risk + PM (788K tokens, 70% cache effectiveness) |
| Pipeline without optimizations | ~$22.51 | Same pipeline, no caching |
Without context management
The same 8-agent pipeline without caching costs $22.51 - 4M prompt tokens at full input rate. With Comis cache system: $12.48. The cache fence alone prevented 36K+ tokens of unnecessary re-writes per pipeline turn by protecting the cached prefix from context engine modifications.
Observability
Every token accounted for
Every pipeline run logs structured metrics. Events fire on every significant action, feeding into the observability dashboard for real-time cost monitoring.
{
"tokensLoaded": 40112,
"tokensEvicted": 0,
"tokensMasked": 0,
"tokensCompacted": 0,
"thinkingBlocksRemoved": 0,
"budgetUtilization": 0.05,
"cacheFenceIndex": 54,
"sessionDepth": 188,
"sessionToolResults": 50,
"durationMs": 1
} cacheFenceIndex: 54 means messages 0-54 are protected from modification. thinkingBlocksRemoved: 0 confirms the fence is preventing unnecessary stripping. Budget utilization at 5% with 188-message depth shows the history window and context layers are keeping costs controlled even in long sessions.
The full picture.
20 mechanisms working together to keep costs predictable at any scale.
| Mechanism | What it saves | When it fires |
|---|---|---|
| Thinking cleaner | Old reasoning traces | Every call (layer 1) |
| Reasoning tag stripper | Non-Anthropic inline reasoning tags | Every call (layer 2) |
| History window | Old conversation turns | Every call (layer 3) |
| Dead content evictor | Superseded file reads, exec results | Every call (layer 4) |
| Observation masker | Old tool outputs (3-tier: protected/standard/ephemeral) | Context > 120K chars (layer 5) |
| LLM compaction | Entire conversation → 9-section summary | Context > 85% of window (layer 6) |
| Rehydration | Re-injects critical context | After compaction (layer 7) |
| Objective reinforcement | Sub-agent goals survive compaction | After compaction (layer 8) |
| Cache fence | Prevents cache prefix invalidation | Every call (all layers) |
| Trim offset translation | Fence survives history-window | After layer 3 |
| Spawn staggering | Shared cache prefix across sub-agents | Pipeline wave start |
| Tiered cache retention | 5m for sub-agents, 1h for parent | Per-session |
| Cache-stable prompts | Dynamic fields out of system prompt | Every call after first |
| Cache break detection | Detects and attributes cache invalidation causes | Post-LLM call |
| Gemini explicit caching | CachedContent API with SHA-256 hashing | Gemini providers |
| Prefix instability detection | Forces short TTL on stuck cache reads | Per-session |
| Microcompaction | Large tool results offloaded | At write time |
| Re-read detector | Identifies duplicate tool calls | Every call |
| Lean tool definitions | 88% of tool description tokens | Every call |
| Just-in-time guidance | Upfront instructional overhead | On first tool use |
| Tool deferral | Irrelevant tool schemas | Context-dependent |
| Model-aware presentation | Schema overhead on small models | Auto by model tier |
| Budget guard | Prevents runaway calls | Pre-call estimation |
Real production data
Anthropic dashboard: April 11, 2026
Actual token usage and costs from a production Comis instance running multi-agent pipelines on Claude Opus 4.6. Not a benchmark - real user traffic.
Token breakdown
Cost breakdown (Opus 4.6)
16.9x
Read/write ratio
81%
Cost savings
$26.42
Without caching
Production agents with predictable costs.
Run agents indefinitely, across thousands of turns, without the exponential token growth that makes naive frameworks unusable at scale.