Cost Optimization

Comprehensive guide for optimizing LLM costs, configuring performance tiers, and implementing sensible defaults across the AgentOS framework.

Overview
Cost Factors
Optimization Strategies
Performance Tiers
Model Selection
RAG Cost Optimization
Storage Cost Optimization
Configuration Reference
Monitoring & Budgets

Overview

AgentOS is designed to be cost-conscious by default while allowing fine-grained control for users who need it. This guide covers:

LLM Costs: Token usage, model selection, caching
RAG Costs: Embedding generation, vector storage, retrieval
Storage Costs: Database operations, sync bandwidth
Compute Costs: Tool execution, streaming overhead

Key Principles

Sensible Defaults: Out-of-box configuration minimizes cost while maintaining quality
Configurable Tradeoffs: Choose between speed, cost, and accuracy
Transparency: Built-in metrics for cost tracking
Graceful Degradation: Falls back to cheaper options when possible

Cost Factors

LLM Token Costs (Approximate)

Model	Input (per 1K)	Output (per 1K)	Context Window
GPT-4o	$0.005	$0.015	128K
GPT-4o-mini	$0.00015	$0.0006	128K
Claude 3.5 Sonnet	$0.003	$0.015	200K
Claude 3 Haiku	$0.00025	$0.00125	200K
Gemini 1.5 Pro	$0.00125	$0.005	1M
Gemini 1.5 Flash	$0.000075	$0.0003	1M

Embedding Costs

Model	Cost (per 1K tokens)	Dimensions
text-embedding-3-small	$0.00002	1536
text-embedding-3-large	$0.00013	3072
text-embedding-ada-002	$0.0001	1536

Storage Costs

Service	Cost	Notes
Local SQLite	Free	Included in app
Supabase (Free)	Free	500MB, 2 projects
Supabase (Pro)	$25/mo	8GB included
Pinecone (Starter)	Free	100K vectors
Pinecone (Standard)	$70/mo	1M vectors

Optimization Strategies

1. Model Tiering

Route different tasks to appropriate models:

import { AgentOS, AgentOSConfig } from '@framers/agentos';

const config: AgentOSConfig = {
  modelRouting: {
    // Use cheap models for simple tasks
    simple: {
      model: 'gpt-4o-mini',
      maxTokens: 500,
    },
    // Use powerful models for complex reasoning
    complex: {
      model: 'gpt-4o',
      maxTokens: 2000,
    },
    // Use very cheap for classification/routing
    routing: {
      model: 'gpt-4o-mini',
      maxTokens: 100,
    },
  },
  
  // Auto-route based on task complexity
  autoRouting: {
    enabled: true,
    complexityThreshold: 0.7, // 0-1 scale
    fallbackModel: 'gpt-4o-mini',
  },
};

const agentos = new AgentOS();
await agentos.initialize(config);

2. Context Window Management

Minimize tokens by smart context management:

const config: AgentOSConfig = {
  contextManagement: {
    // Max tokens for conversation history
    maxHistoryTokens: 4000,
    
    // Max tokens for RAG context
    maxRAGContextTokens: 2000,
    
    // Summarization strategy for long conversations
    summarizationStrategy: 'progressive', // 'none' | 'progressive' | 'aggressive'
    
    // When to summarize (percentage of max tokens)
    summarizeThreshold: 0.8,
    
    // Use cheaper model for summarization
    summarizationModel: 'gpt-4o-mini',
  },
};

3. Response Caching

Cache common responses to avoid repeated LLM calls:

const config: AgentOSConfig = {
  caching: {
    enabled: true,
    
    // Cache identical prompts
    promptCache: {
      enabled: true,
      ttlSeconds: 3600, // 1 hour
      maxEntries: 1000,
    },
    
    // Cache semantic similarity (fuzzy matching)
    semanticCache: {
      enabled: true,
      similarityThreshold: 0.95, // Very high similarity required
      ttlSeconds: 7200, // 2 hours
    },
    
    // Cache tool results
    toolResultCache: {
      enabled: true,
      ttlSeconds: 300, // 5 minutes
    },
  },
};

4. Streaming Optimization

Optimize streaming for cost and latency:

const config: AgentOSConfig = {
  streaming: {
    // Enable streaming (better UX, same cost)
    enabled: true,
    
    // Batch small chunks (reduce overhead)
    batchingEnabled: true,
    batchIntervalMs: 50,
    
    // Auto-stop on user interruption (save tokens)
    interruptionHandling: 'stop', // 'stop' | 'complete' | 'summarize'
  },
};

5. Tool Execution Optimization

Minimize tool call overhead:

const config: AgentOSConfig = {
  toolExecution: {
    // Parallel tool execution (faster, same cost)
    parallelExecution: true,
    maxConcurrent: 5,
    
    // Cache tool results
    cacheResults: true,
    cacheTtlSeconds: 300,
    
    // Limit tool iterations (prevent runaway costs)
    maxIterations: 10,
    
    // Timeout for individual tools
    timeoutMs: 30000,
  },
};

Performance Tiers

Tier Configuration

type PerformanceTier = 'economy' | 'balanced' | 'performance' | 'custom';

const TIER_DEFAULTS = {
  economy: {
    defaultModel: 'gpt-4o-mini',
    maxTokensPerTurn: 500,
    cachingEnabled: true,
    summarizationEnabled: true,
    toolParallelization: false,
    ragEnabled: false,
  },
  balanced: {
    defaultModel: 'gpt-4o-mini',
    maxTokensPerTurn: 1000,
    cachingEnabled: true,
    summarizationEnabled: true,
    toolParallelization: true,
    ragEnabled: true,
  },
  performance: {
    defaultModel: 'gpt-4o',
    maxTokensPerTurn: 4000,
    cachingEnabled: false,
    summarizationEnabled: false,
    toolParallelization: true,
    ragEnabled: true,
  },
};

Usage

import { AgentOS } from '@framers/agentos';

// Economy tier: Minimize costs
const economyAgent = new AgentOS();
await economyAgent.initialize({
  performanceTier: 'economy',
});

// Balanced tier: Default, good for most use cases
const balancedAgent = new AgentOS();
await balancedAgent.initialize({
  performanceTier: 'balanced',
});

// Performance tier: Maximum capability
const performanceAgent = new AgentOS();
await performanceAgent.initialize({
  performanceTier: 'performance',
});

// Custom tier: Full control
const customAgent = new AgentOS();
await customAgent.initialize({
  performanceTier: 'custom',
  defaultModel: 'claude-3-haiku',
  maxTokensPerTurn: 2000,
  cachingEnabled: true,
  // ... other options
});

Tier Comparison

Feature	Economy	Balanced	Performance
Default Model	gpt-4o-mini	gpt-4o-mini	gpt-4o
Max Tokens	500	1000	4000
Caching	✅	✅	❌
RAG	❌	✅	✅
Tool Parallel	❌	✅	✅
Est. Cost/1K turns	~$0.10	~$0.50	~$5.00

Model Selection

Automatic Model Selection

AgentOS can automatically select models based on task complexity:

const config: AgentOSConfig = {
  modelSelection: {
    strategy: 'auto', // 'fixed' | 'auto' | 'user-preference'
    
    // Complexity detection
    complexityDetection: {
      // Use fast classifier to estimate complexity
      classifier: 'rule-based', // 'rule-based' | 'ml' | 'llm'
      
      // Factors considered
      factors: [
        'messageLength',      // Longer = more complex
        'technicalTerms',     // Technical vocabulary
        'questionType',       // 'factual' vs 'analytical'
        'toolRequirements',   // Tools needed
        'contextDependency',  // Needs history?
      ],
    },
    
    // Model mapping by complexity
    complexityMapping: {
      low: 'gpt-4o-mini',      // 0.0 - 0.3
      medium: 'gpt-4o-mini',   // 0.3 - 0.7
      high: 'gpt-4o',          // 0.7 - 1.0
    },
  },
};

Provider Fallback

Configure fallback providers for reliability and cost:

const config: AgentOSConfig = {
  providers: {
    primary: {
      name: 'openai',
      models: ['gpt-4o', 'gpt-4o-mini'],
      apiKey: process.env.OPENAI_API_KEY,
    },
    fallback: [
      {
        name: 'anthropic',
        models: ['claude-3-haiku'],
        apiKey: process.env.ANTHROPIC_API_KEY,
        // Use when primary fails or is expensive
        conditions: {
          onPrimaryFailure: true,
          onPrimaryOverBudget: true,
        },
      },
    ],
  },
};

RAG Cost Optimization

Embedding Strategy

const config: AgentOSConfig = {
  rag: {
    embedding: {
      // Use smaller, cheaper embeddings
      model: 'text-embedding-3-small', // vs 'large'
      dimensions: 1536,
      
      // Batch embeddings (reduce API calls)
      batchSize: 100,
      
      // Cache embeddings (don't re-embed same content)
      cacheEnabled: true,
      cacheTtlDays: 30,
      
      // Content deduplication (skip identical content)
      deduplication: true,
    },
  },
};

Retrieval Optimization

const config: AgentOSConfig = {
  rag: {
    retrieval: {
      // Limit retrieved documents
      topK: 5, // Don't retrieve more than needed
      
      // Minimum relevance threshold
      minScore: 0.7, // Skip low-relevance results
      
      // Hybrid search (lexical + semantic)
      hybridSearch: {
        enabled: true,
        // Lexical is free, semantic costs embeddings
        lexicalWeight: 0.3,
        semanticWeight: 0.7,
      },
      
      // Progressive retrieval (start small, expand if needed)
      progressive: {
        enabled: true,
        initialK: 3,
        maxK: 10,
        expansionThreshold: 0.5, // Expand if avg score < threshold
      },
    },
  },
};

Vector Store Selection

const config: AgentOSConfig = {
  rag: {
    vectorStore: {
      // Use local store for development (free)
      development: {
        type: 'in-memory',
      },
      
      // Use efficient hosted store for production
      production: {
        type: 'pinecone',
        index: 'production',
        // Serverless = cheaper for low traffic
        serverless: true,
      },
    },
  },
};

Storage Cost Optimization

AgentOS integrates with @framers/sql-storage-adapter for local-first storage:

import { AgentOS } from '@framers/agentos';
import { createDatabase } from '@framers/sql-storage-adapter';

// Use efficient storage tier
const db = await createDatabase({
  priority: ['indexeddb', 'sqljs'], // Free local storage
  performance: {
    tier: 'efficient',
    batchWrites: true,
    cacheEnabled: true,
  },
});

const config: AgentOSConfig = {
  storage: {
    // Use local storage (free) for:
    conversations: db,
    personas: db,
    preferences: db,
    
    // Only use cloud for synced data
    sync: {
      enabled: true,
      strategy: 'incremental', // Only sync changes
      interval: 60000, // 1 minute
    },
  },
};

Configuration Reference

Full Configuration Schema

interface AgentOSCostConfig {
  // Performance tier preset
  performanceTier?: 'economy' | 'balanced' | 'performance' | 'custom';
  
  // Model configuration
  models?: {
    default?: string;
    routing?: string;
    summarization?: string;
    embedding?: string;
  };
  
  // Token limits
  limits?: {
    maxTokensPerTurn?: number;
    maxTokensPerDay?: number;
    maxCostPerDay?: number; // USD
    maxToolIterations?: number;
  };
  
  // Caching
  caching?: {
    promptCache?: boolean;
    semanticCache?: boolean;
    toolCache?: boolean;
    embeddingCache?: boolean;
  };
  
  // Context management
  context?: {
    maxHistoryTokens?: number;
    maxRAGTokens?: number;
    summarizationEnabled?: boolean;
  };
  
  // RAG settings
  rag?: {
    enabled?: boolean;
    topK?: number;
    minScore?: number;
    hybridSearch?: boolean;
  };
  
  // Tool execution
  tools?: {
    parallelExecution?: boolean;
    maxConcurrent?: number;
    cacheResults?: boolean;
    timeoutMs?: number;
  };
}

Environment Variables

Variable	Description	Default
`AGENTOS_PERFORMANCE_TIER`	Preset tier	`balanced`
`AGENTOS_DEFAULT_MODEL`	Default LLM model	`gpt-4o-mini`
`AGENTOS_MAX_TOKENS_PER_TURN`	Token limit per turn	`1000`
`AGENTOS_MAX_COST_PER_DAY`	Daily cost limit (USD)	`10.00`
`AGENTOS_CACHING_ENABLED`	Enable all caching	`true`
`AGENTOS_RAG_ENABLED`	Enable RAG	`true`
`AGENTOS_RAG_TOP_K`	RAG retrieval count	`5`

Monitoring & Budgets

Cost Tracking

const agentos = new AgentOS();

// Subscribe to cost events
agentos.on('cost:turn', (event) => {
  console.log(`Turn cost: $${event.cost.toFixed(4)}`);
  console.log(`  - Input tokens: ${event.inputTokens}`);
  console.log(`  - Output tokens: ${event.outputTokens}`);
  console.log(`  - Model: ${event.model}`);
});

agentos.on('cost:daily', (event) => {
  console.log(`Daily total: $${event.total.toFixed(2)}`);
  if (event.total > event.budget * 0.8) {
    console.warn('Approaching daily budget limit!');
  }
});

// Get usage summary
const usage = await agentos.getUsageSummary({
  period: 'day', // 'hour' | 'day' | 'week' | 'month'
});

console.log(`
Usage Summary:
  Total cost: $${usage.cost.toFixed(2)}
  Total turns: ${usage.turns}
  Total tokens: ${usage.tokens}
  Avg cost/turn: $${(usage.cost / usage.turns).toFixed(4)}
`);

Budget Enforcement

const config: AgentOSConfig = {
  budgets: {
    // Hard limits (will reject requests)
    hardLimits: {
      perTurn: 0.10,      // Max $0.10 per turn
      perHour: 5.00,      // Max $5 per hour
      perDay: 20.00,      // Max $20 per day
    },
    
    // Soft limits (will warn and downgrade)
    softLimits: {
      perHour: 3.00,      // Warn at $3/hour
      perDay: 15.00,      // Warn at $15/day
    },
    
    // Actions when limits reached
    limitActions: {
      softLimit: 'downgrade-model', // Switch to cheaper model
      hardLimit: 'reject',           // Reject request
    },
  },
};

Best Practices

1. Start with Economy Tier

// Development and testing
const dev = new AgentOS();
await dev.initialize({ performanceTier: 'economy' });

2. Use Model Tiering in Production

// Route simple queries to cheap models
const prod = new AgentOS();
await prod.initialize({
  modelRouting: {
    simple: { model: 'gpt-4o-mini' },
    complex: { model: 'gpt-4o' },
  },
  autoRouting: { enabled: true },
});

3. Enable Caching

// Cache everything possible
const cached = new AgentOS();
await cached.initialize({
  caching: {
    promptCache: true,
    semanticCache: true,
    toolCache: true,
    embeddingCache: true,
  },
});

4. Set Budget Limits

// Always set limits in production
const safe = new AgentOS();
await safe.initialize({
  budgets: {
    hardLimits: { perDay: 50.00 },
    limitActions: { hardLimit: 'reject' },
  },
});

5. Monitor Usage

// Track costs in real-time
agentos.on('cost:turn', (e) => metrics.record('agentos.cost', e.cost));

Summary

Strategy	Savings	Implementation
Use economy tier	80-90%	`performanceTier: 'economy'`
Model tiering	50-70%	Route by complexity
Caching	20-40%	Enable all caches
Context limits	20-30%	Set max tokens
RAG optimization	30-50%	Hybrid search, low topK
Budget enforcement	Unlimited	Hard limits

Table of Contents​

Overview​

Key Principles​

Cost Factors​

LLM Token Costs (Approximate)​

Embedding Costs​

Storage Costs​

Optimization Strategies​

1. Model Tiering​

2. Context Window Management​

3. Response Caching​

4. Streaming Optimization​

5. Tool Execution Optimization​

Performance Tiers​

Tier Configuration​

Usage​

Tier Comparison​

Model Selection​

Automatic Model Selection​

Provider Fallback​

RAG Cost Optimization​

Embedding Strategy​

Retrieval Optimization​

Vector Store Selection​

Storage Cost Optimization​

Configuration Reference​

Full Configuration Schema​

Environment Variables​

Monitoring & Budgets​

Cost Tracking​

Budget Enforcement​

Best Practices​

1. Start with Economy Tier​

2. Use Model Tiering in Production​

3. Enable Caching​

4. Set Budget Limits​

5. Monitor Usage​

Summary​

Table of Contents