Skip to main content

Cost Optimization

Comprehensive guide for optimizing LLM costs, configuring performance tiers, and implementing sensible defaults across the AgentOS framework.


Table of Contents

  1. Overview
  2. Cost Factors
  3. Optimization Strategies
  4. Performance Tiers
  5. Model Selection
  6. RAG Cost Optimization
  7. Storage Cost Optimization
  8. Configuration Reference
  9. Monitoring & Budgets

Overview

AgentOS is designed to be cost-conscious by default while allowing fine-grained control for users who need it. This guide covers:

  • LLM Costs: Token usage, model selection, caching
  • RAG Costs: Embedding generation, vector storage, retrieval
  • Storage Costs: Database operations, sync bandwidth
  • Compute Costs: Tool execution, streaming overhead

Key Principles

  1. Sensible Defaults: Out-of-box configuration minimizes cost while maintaining quality
  2. Configurable Tradeoffs: Choose between speed, cost, and accuracy
  3. Transparency: Built-in metrics for cost tracking
  4. Graceful Degradation: Falls back to cheaper options when possible

Cost Factors

LLM Token Costs (Approximate)

ModelInput (per 1K)Output (per 1K)Context Window
GPT-4o$0.005$0.015128K
GPT-4o-mini$0.00015$0.0006128K
Claude 3.5 Sonnet$0.003$0.015200K
Claude 3 Haiku$0.00025$0.00125200K
Gemini 1.5 Pro$0.00125$0.0051M
Gemini 1.5 Flash$0.000075$0.00031M

Embedding Costs

ModelCost (per 1K tokens)Dimensions
text-embedding-3-small$0.000021536
text-embedding-3-large$0.000133072
text-embedding-ada-002$0.00011536

Storage Costs

ServiceCostNotes
Local SQLiteFreeIncluded in app
Supabase (Free)Free500MB, 2 projects
Supabase (Pro)$25/mo8GB included
Pinecone (Starter)Free100K vectors
Pinecone (Standard)$70/mo1M vectors

Optimization Strategies

1. Model Tiering

Route different tasks to appropriate models:

import { AgentOS, AgentOSConfig } from '@framers/agentos';

const config: AgentOSConfig = {
modelRouting: {
// Use cheap models for simple tasks
simple: {
model: 'gpt-4o-mini',
maxTokens: 500,
},
// Use powerful models for complex reasoning
complex: {
model: 'gpt-4o',
maxTokens: 2000,
},
// Use very cheap for classification/routing
routing: {
model: 'gpt-4o-mini',
maxTokens: 100,
},
},

// Auto-route based on task complexity
autoRouting: {
enabled: true,
complexityThreshold: 0.7, // 0-1 scale
fallbackModel: 'gpt-4o-mini',
},
};

const agentos = new AgentOS();
await agentos.initialize(config);

2. Context Window Management

Minimize tokens by smart context management:

const config: AgentOSConfig = {
contextManagement: {
// Max tokens for conversation history
maxHistoryTokens: 4000,

// Max tokens for RAG context
maxRAGContextTokens: 2000,

// Summarization strategy for long conversations
summarizationStrategy: 'progressive', // 'none' | 'progressive' | 'aggressive'

// When to summarize (percentage of max tokens)
summarizeThreshold: 0.8,

// Use cheaper model for summarization
summarizationModel: 'gpt-4o-mini',
},
};

3. Response Caching

Cache common responses to avoid repeated LLM calls:

const config: AgentOSConfig = {
caching: {
enabled: true,

// Cache identical prompts
promptCache: {
enabled: true,
ttlSeconds: 3600, // 1 hour
maxEntries: 1000,
},

// Cache semantic similarity (fuzzy matching)
semanticCache: {
enabled: true,
similarityThreshold: 0.95, // Very high similarity required
ttlSeconds: 7200, // 2 hours
},

// Cache tool results
toolResultCache: {
enabled: true,
ttlSeconds: 300, // 5 minutes
},
},
};

4. Streaming Optimization

Optimize streaming for cost and latency:

const config: AgentOSConfig = {
streaming: {
// Enable streaming (better UX, same cost)
enabled: true,

// Batch small chunks (reduce overhead)
batchingEnabled: true,
batchIntervalMs: 50,

// Auto-stop on user interruption (save tokens)
interruptionHandling: 'stop', // 'stop' | 'complete' | 'summarize'
},
};

5. Tool Execution Optimization

Minimize tool call overhead:

const config: AgentOSConfig = {
toolExecution: {
// Parallel tool execution (faster, same cost)
parallelExecution: true,
maxConcurrent: 5,

// Cache tool results
cacheResults: true,
cacheTtlSeconds: 300,

// Limit tool iterations (prevent runaway costs)
maxIterations: 10,

// Timeout for individual tools
timeoutMs: 30000,
},
};

Performance Tiers

Tier Configuration

type PerformanceTier = 'economy' | 'balanced' | 'performance' | 'custom';

const TIER_DEFAULTS = {
economy: {
defaultModel: 'gpt-4o-mini',
maxTokensPerTurn: 500,
cachingEnabled: true,
summarizationEnabled: true,
toolParallelization: false,
ragEnabled: false,
},
balanced: {
defaultModel: 'gpt-4o-mini',
maxTokensPerTurn: 1000,
cachingEnabled: true,
summarizationEnabled: true,
toolParallelization: true,
ragEnabled: true,
},
performance: {
defaultModel: 'gpt-4o',
maxTokensPerTurn: 4000,
cachingEnabled: false,
summarizationEnabled: false,
toolParallelization: true,
ragEnabled: true,
},
};

Usage

import { AgentOS } from '@framers/agentos';

// Economy tier: Minimize costs
const economyAgent = new AgentOS();
await economyAgent.initialize({
performanceTier: 'economy',
});

// Balanced tier: Default, good for most use cases
const balancedAgent = new AgentOS();
await balancedAgent.initialize({
performanceTier: 'balanced',
});

// Performance tier: Maximum capability
const performanceAgent = new AgentOS();
await performanceAgent.initialize({
performanceTier: 'performance',
});

// Custom tier: Full control
const customAgent = new AgentOS();
await customAgent.initialize({
performanceTier: 'custom',
defaultModel: 'claude-3-haiku',
maxTokensPerTurn: 2000,
cachingEnabled: true,
// ... other options
});

Tier Comparison

FeatureEconomyBalancedPerformance
Default Modelgpt-4o-minigpt-4o-minigpt-4o
Max Tokens50010004000
Caching
RAG
Tool Parallel
Est. Cost/1K turns~$0.10~$0.50~$5.00

Model Selection

Automatic Model Selection

AgentOS can automatically select models based on task complexity:

const config: AgentOSConfig = {
modelSelection: {
strategy: 'auto', // 'fixed' | 'auto' | 'user-preference'

// Complexity detection
complexityDetection: {
// Use fast classifier to estimate complexity
classifier: 'rule-based', // 'rule-based' | 'ml' | 'llm'

// Factors considered
factors: [
'messageLength', // Longer = more complex
'technicalTerms', // Technical vocabulary
'questionType', // 'factual' vs 'analytical'
'toolRequirements', // Tools needed
'contextDependency', // Needs history?
],
},

// Model mapping by complexity
complexityMapping: {
low: 'gpt-4o-mini', // 0.0 - 0.3
medium: 'gpt-4o-mini', // 0.3 - 0.7
high: 'gpt-4o', // 0.7 - 1.0
},
},
};

Provider Fallback

Configure fallback providers for reliability and cost:

const config: AgentOSConfig = {
providers: {
primary: {
name: 'openai',
models: ['gpt-4o', 'gpt-4o-mini'],
apiKey: process.env.OPENAI_API_KEY,
},
fallback: [
{
name: 'anthropic',
models: ['claude-3-haiku'],
apiKey: process.env.ANTHROPIC_API_KEY,
// Use when primary fails or is expensive
conditions: {
onPrimaryFailure: true,
onPrimaryOverBudget: true,
},
},
],
},
};

RAG Cost Optimization

Embedding Strategy

const config: AgentOSConfig = {
rag: {
embedding: {
// Use smaller, cheaper embeddings
model: 'text-embedding-3-small', // vs 'large'
dimensions: 1536,

// Batch embeddings (reduce API calls)
batchSize: 100,

// Cache embeddings (don't re-embed same content)
cacheEnabled: true,
cacheTtlDays: 30,

// Content deduplication (skip identical content)
deduplication: true,
},
},
};

Retrieval Optimization

const config: AgentOSConfig = {
rag: {
retrieval: {
// Limit retrieved documents
topK: 5, // Don't retrieve more than needed

// Minimum relevance threshold
minScore: 0.7, // Skip low-relevance results

// Hybrid search (lexical + semantic)
hybridSearch: {
enabled: true,
// Lexical is free, semantic costs embeddings
lexicalWeight: 0.3,
semanticWeight: 0.7,
},

// Progressive retrieval (start small, expand if needed)
progressive: {
enabled: true,
initialK: 3,
maxK: 10,
expansionThreshold: 0.5, // Expand if avg score < threshold
},
},
},
};

Vector Store Selection

const config: AgentOSConfig = {
rag: {
vectorStore: {
// Use local store for development (free)
development: {
type: 'in-memory',
},

// Use efficient hosted store for production
production: {
type: 'pinecone',
index: 'production',
// Serverless = cheaper for low traffic
serverless: true,
},
},
},
};

Storage Cost Optimization

AgentOS integrates with @framers/sql-storage-adapter for local-first storage:

import { AgentOS } from '@framers/agentos';
import { createDatabase } from '@framers/sql-storage-adapter';

// Use efficient storage tier
const db = await createDatabase({
priority: ['indexeddb', 'sqljs'], // Free local storage
performance: {
tier: 'efficient',
batchWrites: true,
cacheEnabled: true,
},
});

const config: AgentOSConfig = {
storage: {
// Use local storage (free) for:
conversations: db,
personas: db,
preferences: db,

// Only use cloud for synced data
sync: {
enabled: true,
strategy: 'incremental', // Only sync changes
interval: 60000, // 1 minute
},
},
};

Configuration Reference

Full Configuration Schema

interface AgentOSCostConfig {
// Performance tier preset
performanceTier?: 'economy' | 'balanced' | 'performance' | 'custom';

// Model configuration
models?: {
default?: string;
routing?: string;
summarization?: string;
embedding?: string;
};

// Token limits
limits?: {
maxTokensPerTurn?: number;
maxTokensPerDay?: number;
maxCostPerDay?: number; // USD
maxToolIterations?: number;
};

// Caching
caching?: {
promptCache?: boolean;
semanticCache?: boolean;
toolCache?: boolean;
embeddingCache?: boolean;
};

// Context management
context?: {
maxHistoryTokens?: number;
maxRAGTokens?: number;
summarizationEnabled?: boolean;
};

// RAG settings
rag?: {
enabled?: boolean;
topK?: number;
minScore?: number;
hybridSearch?: boolean;
};

// Tool execution
tools?: {
parallelExecution?: boolean;
maxConcurrent?: number;
cacheResults?: boolean;
timeoutMs?: number;
};
}

Environment Variables

VariableDescriptionDefault
AGENTOS_PERFORMANCE_TIERPreset tierbalanced
AGENTOS_DEFAULT_MODELDefault LLM modelgpt-4o-mini
AGENTOS_MAX_TOKENS_PER_TURNToken limit per turn1000
AGENTOS_MAX_COST_PER_DAYDaily cost limit (USD)10.00
AGENTOS_CACHING_ENABLEDEnable all cachingtrue
AGENTOS_RAG_ENABLEDEnable RAGtrue
AGENTOS_RAG_TOP_KRAG retrieval count5

Monitoring & Budgets

Cost Tracking

const agentos = new AgentOS();

// Subscribe to cost events
agentos.on('cost:turn', (event) => {
console.log(`Turn cost: $${event.cost.toFixed(4)}`);
console.log(` - Input tokens: ${event.inputTokens}`);
console.log(` - Output tokens: ${event.outputTokens}`);
console.log(` - Model: ${event.model}`);
});

agentos.on('cost:daily', (event) => {
console.log(`Daily total: $${event.total.toFixed(2)}`);
if (event.total > event.budget * 0.8) {
console.warn('Approaching daily budget limit!');
}
});

// Get usage summary
const usage = await agentos.getUsageSummary({
period: 'day', // 'hour' | 'day' | 'week' | 'month'
});

console.log(`
Usage Summary:
Total cost: $${usage.cost.toFixed(2)}
Total turns: ${usage.turns}
Total tokens: ${usage.tokens}
Avg cost/turn: $${(usage.cost / usage.turns).toFixed(4)}
`);

Budget Enforcement

const config: AgentOSConfig = {
budgets: {
// Hard limits (will reject requests)
hardLimits: {
perTurn: 0.10, // Max $0.10 per turn
perHour: 5.00, // Max $5 per hour
perDay: 20.00, // Max $20 per day
},

// Soft limits (will warn and downgrade)
softLimits: {
perHour: 3.00, // Warn at $3/hour
perDay: 15.00, // Warn at $15/day
},

// Actions when limits reached
limitActions: {
softLimit: 'downgrade-model', // Switch to cheaper model
hardLimit: 'reject', // Reject request
},
},
};

Best Practices

1. Start with Economy Tier

// Development and testing
const dev = new AgentOS();
await dev.initialize({ performanceTier: 'economy' });

2. Use Model Tiering in Production

// Route simple queries to cheap models
const prod = new AgentOS();
await prod.initialize({
modelRouting: {
simple: { model: 'gpt-4o-mini' },
complex: { model: 'gpt-4o' },
},
autoRouting: { enabled: true },
});

3. Enable Caching

// Cache everything possible
const cached = new AgentOS();
await cached.initialize({
caching: {
promptCache: true,
semanticCache: true,
toolCache: true,
embeddingCache: true,
},
});

4. Set Budget Limits

// Always set limits in production
const safe = new AgentOS();
await safe.initialize({
budgets: {
hardLimits: { perDay: 50.00 },
limitActions: { hardLimit: 'reject' },
},
});

5. Monitor Usage

// Track costs in real-time
agentos.on('cost:turn', (e) => metrics.record('agentos.cost', e.cost));

Summary

StrategySavingsImplementation
Use economy tier80-90%performanceTier: 'economy'
Model tiering50-70%Route by complexity
Caching20-40%Enable all caches
Context limits20-30%Set max tokens
RAG optimization30-50%Hybrid search, low topK
Budget enforcementUnlimitedHard limits