← Back to Home

How Comis cuts LLM costs by 81%

8-layer context engine, cache fence system, sub-agent spawn staggering, progressive tool disclosure, and 20 mechanisms - so your agents run indefinitely with predictable costs.

17x

Cache read/write ratio

94%

Tokens served from cache

93%

Writes are cold-start only

$0.50

Per MTok cached reads

The problem

Context windows are expensive.

System prompts are large

Identity, instructions, tool definitions, workspace files, and security guardrails easily reach 30-60K tokens.

Conversations grow fast

Tool results (file reads, web fetches, API responses) can be 10-50K tokens each. A 30-turn session with tool use can cost $5-15.

Multi-agent pipelines multiply costs

A 7-agent stock analysis pipeline costs 7x the base rate per message. Without optimization, this becomes prohibitive.

Cache misses are silent killers

Anthropic charges 10-20x more for cache writes ($6.25-10/MTok) than reads ($0.50/MTok). A single modified message in the cached prefix invalidates the entire KV cache.

The solution

8-layer context engine pipeline

Every conversation is processed through a composable pipeline of 8 layers before each LLM call. Each layer targets a specific source of token waste.

Every layer has a circuit breaker - 3 consecutive failures disables the layer. No single optimization bug can bring down the pipeline.

1

Thinking Block Cleaner

Every call

Strips extended thinking traces from older turns (default: 10 turns), keeping recent reasoning while reclaiming tokens from stale deliberation.

2

Reasoning Tag Stripper

Every call

Strips inline reasoning tags (<think>, <thinking>, <thought>, <antThinking>) from non-Anthropic provider responses persisted in session history. Always active regardless of the current model's reasoning capabilities, since sessions may contain messages from multiple providers.

3

History Window

Every call

Caps conversation history to the last N user turns (default: 15, configurable per channel). Pair-safe: never splits a tool-call/tool-result pair. Compaction summaries always preserved as anchors.

4

Dead Content Evictor

Every call

Uses forward-index O(n) analysis to detect provably superseded tool results. If an agent read a file at turn 5 and again at turn 20, the turn-5 result is replaced with a 50-byte placeholder. Tracks 5 categories: file_read, exec, web, image, error.

5

Observation Masker

Context > 120K chars

Three-tier masking when total context exceeds 120K characters. Protected tools (memory, file reads) are never masked. Standard tools use a keep window (default: 25 most recent). Ephemeral tools (web searches, fetches) get a shorter keep window (default: 10). Masked entries persist to disk for stable cache prefixes. Hysteresis prevents oscillation: activates at 120K, deactivates below 80K.

6

LLM Compaction

Context > 85% of window

Last-resort compression when context exceeds 85% of the model window. Compresses 50+ messages into a structured 9-section summary using a cheaper model (Haiku by default). Three-tier fallback: full summarization, filtered summarization, count-only note.

7

Rehydration

After compaction

After compaction, strategically re-injects only what was lost: workspace instructions (AGENTS.md, max 3K chars), recently-accessed files (max 5 files, 8K each), and a resume instruction for seamless continuation.

8

Objective Reinforcement

After compaction

For sub-agents, re-injects the original task objective after compaction so delegated tasks stay on track even through context compression.

Prompt caching

Cache-stable prompt architecture

Anthropic's prompt caching cuts costs 7.5x - but only if the system prompt stays identical across turns. Most frameworks embed timestamps or channel metadata in the system prompt, silently invalidating the cache on every message.

Static

System Prompt

Content

Identity, personality, workspace files, tool definitions, security rules

Cache behavior

Cached - paid once at write rate, then $0.50/MTok on every subsequent call

Per-turn

Dynamic Preamble

Content

Timestamp, sender metadata, channel context, RAG results, active skills, trust entries

Cache behavior

Prepended to user message - never invalidates the cache prefix

Six categories kept out of the system prompt

Date/time, inbound message metadata, channel context, RAG memory results, active skill content, and sender trust entries. Each would invalidate the entire cache prefix if left inline (CACHE-01 through CACHE-06).

Cache fence system

Protecting the cached prefix across turns

Without protection, context engine layers modify messages within the cached region - stripping thinking blocks, masking tool results, evicting dead content. Each modification invalidates Anthropic's KV cache, forcing expensive re-writes. The cache fence prevents this.

Active cache optimization, not passive configuration

Most agent platforms treat caching as a provider feature to configure - a single cache_control tag on the system message. Comis treats the Anthropic cache as a target architecture to optimize for. The context engine pipeline is cache-aware: it knows where the cache boundary is and actively prevents modifications within the cached region.

breakpoint callback → fence index → layer guards → trim offset translation
         ↑                                                    ↓
         ←———— persisted across execute() calls ————→

Multiple feedback loops ensure cache stability across turns, achieving a 16.9x read/write ratio in production - meaning for every token written to cache, 17 tokens are served from it at 10x lower cost.

Cache fence feedback loop

TURN N Session Messages Protected by fence (0..52) Modifiable Context Engine (8 layers + fence guard) Post-CE output (64 msgs after trim) LLM Call + Breakpoints onBreakpointsPlaced(idx=52) TURN N+1 Seed: fence = 52 + trimOffset(116) = 168 Session Messages (grew by 4) Protected (0..168) New msgs History window trims 118 from front Adjusted: fence = max(-1, 168-118) = 50 Layers skip msgs 0..50 (cache preserved) Persisted via Map across execute() calls
CACHE-20

Cache Fence Index

Tracks the highest cache breakpoint position from each LLM call. On the next turn, context engine layers skip all messages at or below this index - preventing modifications that would invalidate the cached prefix.

Eliminated 36K+ tokens of unnecessary cache invalidation per pipeline turn

CACHE-20

Trim Offset Translation

The history window trims 100+ messages from the front of the session, shifting all indices. The fence index is stored in pre-trim space and correctly adjusted after trimming so protection survives across turns.

Fixed fence being zeroed out by 97-120 message trims

CACHE-21

Sub-Agent Spawn Staggering

Concurrent sub-agents in a pipeline wave are staggered by 4 seconds. The first sub-agent populates the shared cache prefix (system prompt + tools), then siblings read it instead of each paying the write cost.

49K avg cache reads on sub-agent first turn (shared prefix)

CACHE-23

Adaptive Cold-Start Retention

Parent agents write at 1h TTL from the first call so the system prompt survives pipeline gaps (>5m). Sub-agents inherit mixed 5m/1h TTLs - shared prefix blocks use the first writer's TTL, conversation-specific content uses 5m. Cache refreshes don't upgrade TTL, so the initial write determines the cache lifetime.

Eliminates re-writes per pipeline gap. 81% cost reduction at steady state.

CACHE-23

TTL Monotonicity Enforcement

Anthropic's API requires cache breakpoint TTLs to be non-increasing (system >= tools >= messages). The SDK sets system to 5m, but tool breakpoints escalate to 1h. Comis upgrades system block TTLs in the onPayload hook to satisfy the constraint, preventing silent downgrades.

Unlocked 1h cache writes for the first time. Without this, all 1h TTLs were silently downgraded to 5m.

CACHE-20/23

Breakpoint Threshold Tuning

Sub-agents use 512-token minimum for breakpoint placement. Parent agents use 1,024 tokens (lowered from 4,096). Enables message-level cache breakpoints in conversational exchanges where individual messages are 500-2000 tokens.

Sub-agent cache ratio improved from 3.0x to 4.7x. Parent agent gained 1 message breakpoint per call (was 0).

Cache diagnostics

Cache break detection and attribution

When cache invalidation happens, knowing why is critical. Comis uses a two-phase detection system that records pre-call state and performs post-call analysis to attribute every cache break to its root cause.

Phase 1: Pre-call snapshot

Records SHA-256 hashes of system prompt, tool schemas, and cache_control metadata before the LLM call. These fingerprints establish the baseline for comparison.

Phase 2: Post-call analysis

After the API response, compares actual cache write tokens against expected values. Dual threshold: >5% relative AND >2K absolute tokens triggers attribution.

Attribution priority chain

When a cache break is detected, the system walks a priority chain to identify the cause:

model change system prompt tool schemas retention change metadata TTL expiry server eviction

Lazy content diffing: full serialization only runs when a break is actually detected, keeping hot-path overhead near zero.

Multi-provider caching

Gemini explicit caching via CachedContent API

Beyond Anthropic's prompt cache, Comis implements explicit caching for Google's Gemini models using the CachedContent API - a guaranteed 90% discount on cached tokens.

Cache Manager

Full CachedContent lifecycle: create, reuse, refresh, and dispose. SHA-256 content hashing detects changes. Concurrent request deduplication prevents duplicate cache entries.

Per-model thresholds

Gemini Flash requires 1,024 minimum cacheable tokens. Gemini Pro requires 4,096. Below-minimum requests fall through to uncached calls automatically.

Lifecycle management

Session expiry events trigger cache disposal. Orphaned caches (from crashes or disconnects) are detected and cleaned up automatically.

Provider isolation

Anthropic's requestBodyInjector and Google's geminiCacheInjector are mutually exclusive at runtime via provider guards. The Gemini injector atomically strips inherited fields (systemInstruction, tools, toolConfig) when a CachedContent name is present, since these are already baked into the cache.

Pipeline caching

How 8 agents share a single cache prefix

In a multi-agent pipeline, each sub-agent shares the same system prompt and tool definitions. Without staggering, all agents would write the same content simultaneously - paying the write cost 8 times. With staggered spawning, the first agent writes once and the rest read at 10x lower cost.

Staggered sub-agent spawning (4s intervals)

Time 0s 4s 5s 7.5s WRITE Agent 1 READ Agent 2 READ Agent 3 READ Agent 4 Write ($6.25-10/MTok) Read ($0.50/MTok)

Without staggering

$22.51

4M prompt tokens at input rate. No cache sharing. Each agent pays full write cost for identical system prompts.

With Comis cache system

$12.48

4.9M tokens from cache reads at $0.50/MTok. First agent writes, siblings read. Fence prevents invalidation. 81% cost reduction.

Anthropic pricing

Why cache reads matter

Cache reads are 10-20x cheaper than writes and 10x cheaper than base input. Every token shifted from write to read directly reduces cost.

Model Input 5m Write 1h Write Cache Read Saving
Claude Opus 4.6 $5.00 $6.25 $10.00 $0.50 10x cheaper
Claude Sonnet 4.6 $3.00 $3.75 $6.00 $0.30 10x cheaper
Claude Haiku 4.5 $1.00 $1.25 $2.00 $0.10 10x cheaper

Prices per million tokens. Source: Anthropic pricing page, June 2025.

Write-time optimization

Microcompaction

Tool results are intercepted at write time. Oversized results are offloaded to disk and replaced with lightweight references. The agent can re-read if needed.

file_read 15,000 chars
MCP tools 15,000 chars
Default tools 8,000 chars
Hard cap 100,000 chars

Tool optimization

Lean tool definitions

Most frameworks send verbose tool descriptions on every turn - usage guides, action lists, workspace paths. Comis uses lean, structured descriptions optimized for tool selection, with detailed guidance delivered just-in-time on first use.

Lean contracts - every tool gets a structured description under 150 characters, optimized for selection accuracy.

Just-in-time guides - detailed usage instructions delivered in the tool response on first use, not loaded upfront.

Progressive disclosure

Three-tier tool context management

With 50+ built-in tools and unlimited MCP integrations, tool definitions can consume more context than the conversation itself. Comis uses a three-tier system that loads tool information progressively based on relevance.

Research shows fewer, better-described tools outperform large tool sets. Disambiguation between similar tools yields 8-38% accuracy gains.

Tier 1 Always present

Lean definitions

Every tool gets a short, structured description optimized for selection - not usage. Confusable tool pairs include explicit disambiguation suffixes so the model picks the right one.

~20 tokens/tool vs ~240 typical
Tier 2 On first use

Just-in-time guidance

Detailed usage instructions, workspace guides, and action requirements are injected into the tool response the first time a tool is called. Delivered exactly when needed, paid only once per session.

~0 tokens upfront loaded on demand
Tier 3 Context-dependent

Deferred loading

Tools irrelevant to the current context - wrong platform, insufficient trust, rarely used - are deferred behind a semantic discovery tool. Admin tools defer for non-admin sessions. Discord tools defer on Telegram.

0 tokens until discovered

Model-aware tool presentation

Small models (8B-70B) get aggressive deferral, pruned parameter schemas, and a focused core tool set. Large models get full schemas with lean descriptions. The model tier is resolved automatically from the context window size - no configuration needed.

Cost control

Three-tier budget guard

Every agent has three budget caps, checked before the LLM call - not after the money is spent.

200K tokens

Per-execution

Prevent runaway single calls

1M tokens

Per-hour

Rate limiting

5M tokens

Per-day

Daily cost ceiling

Production data

Real-world cost profile

From a production session running an 8-agent stock analysis pipeline with Claude Opus 4.6 (TradingAgents: 4 analysts, bull-bear debate, trader, risk management, portfolio manager).

Operation Cost Notes
First "Hello" $0.25 Cold cache - full system prompt write (Opus)
Second message $0.03 Warm cache - fence protects prefix
Subsequent messages $0.04-0.09 Stable cache reads, fence active
8-agent trading pipeline ~$2.11 4 analysts + debate + trader + risk + PM (788K tokens, 70% cache effectiveness)
Pipeline without optimizations ~$22.51 Same pipeline, no caching

Without context management

The same 8-agent pipeline without caching costs $22.51 - 4M prompt tokens at full input rate. With Comis cache system: $12.48. The cache fence alone prevented 36K+ tokens of unnecessary re-writes per pipeline turn by protecting the cached prefix from context engine modifications.

Observability

Every token accounted for

Every pipeline run logs structured metrics. Events fire on every significant action, feeding into the observability dashboard for real-time cost monitoring.

{
  "tokensLoaded": 40112,
  "tokensEvicted": 0,
  "tokensMasked": 0,
  "tokensCompacted": 0,
  "thinkingBlocksRemoved": 0,
  "budgetUtilization": 0.05,
  "cacheFenceIndex": 54,
  "sessionDepth": 188,
  "sessionToolResults": 50,
  "durationMs": 1
}

cacheFenceIndex: 54 means messages 0-54 are protected from modification. thinkingBlocksRemoved: 0 confirms the fence is preventing unnecessary stripping. Budget utilization at 5% with 188-message depth shows the history window and context layers are keeping costs controlled even in long sessions.

The full picture.

20 mechanisms working together to keep costs predictable at any scale.

Mechanism What it saves When it fires
Thinking cleaner Old reasoning traces Every call (layer 1)
Reasoning tag stripper Non-Anthropic inline reasoning tags Every call (layer 2)
History window Old conversation turns Every call (layer 3)
Dead content evictor Superseded file reads, exec results Every call (layer 4)
Observation masker Old tool outputs (3-tier: protected/standard/ephemeral) Context > 120K chars (layer 5)
LLM compaction Entire conversation → 9-section summary Context > 85% of window (layer 6)
Rehydration Re-injects critical context After compaction (layer 7)
Objective reinforcement Sub-agent goals survive compaction After compaction (layer 8)
Cache fence Prevents cache prefix invalidation Every call (all layers)
Trim offset translation Fence survives history-window After layer 3
Spawn staggering Shared cache prefix across sub-agents Pipeline wave start
Tiered cache retention 5m for sub-agents, 1h for parent Per-session
Cache-stable prompts Dynamic fields out of system prompt Every call after first
Cache break detection Detects and attributes cache invalidation causes Post-LLM call
Gemini explicit caching CachedContent API with SHA-256 hashing Gemini providers
Prefix instability detection Forces short TTL on stuck cache reads Per-session
Microcompaction Large tool results offloaded At write time
Re-read detector Identifies duplicate tool calls Every call
Lean tool definitions 88% of tool description tokens Every call
Just-in-time guidance Upfront instructional overhead On first tool use
Tool deferral Irrelevant tool schemas Context-dependent
Model-aware presentation Schema overhead on small models Auto by model tier
Budget guard Prevents runaway calls Pre-call estimation

Real production data

Anthropic dashboard: April 11, 2026

Actual token usage and costs from a production Comis instance running multi-agent pipelines on Claude Opus 4.6. Not a benchmark - real user traffic.

Token breakdown

Input 62
Output 20,662
Cache write (1h) 289,900
Cache read 4,892,893
Total 5,213,517
Cache read (94%) Cache write (6%)

Cost breakdown (Opus 4.6)

Input $5/MTok
$0.00
Output $25/MTok
$0.52
Cache write mixed 5m/1h TTL
$2.06
Cache read $0.50/MTok
$2.45
Total $5.02

16.9x

Read/write ratio

81%

Cost savings

$26.42

Without caching

Production agents with predictable costs.

Run agents indefinitely, across thousands of turns, without the exponential token growth that makes naive frameworks unusable at scale.