Skip to main content

Vision Pipeline (OCR & Image Understanding)

A 3-tier progressive enhancement pipeline for extracting text, understanding images, and generating visual embeddings.


Table of Contents

  1. Overview
  2. Quick Start
  3. Architecture
  4. Three-Tier Progressive Enhancement
  5. Processing Strategies
  6. Content Detection
  7. CLIP Embeddings
  8. Integration with Multimodal RAG
  9. VisionPipeline API Reference
  10. createVisionPipeline() Options
  11. VisionResult Shape
  12. Installation
  13. Provider Configuration
  14. Examples
  15. Related Documentation

Overview

The Vision Pipeline provides a unified interface for extracting text, detecting content types, and generating embeddings from images. It uses a tiered architecture that automatically selects the best available provider, from lightweight local OCR to high-accuracy cloud vision APIs.

Key capabilities:

CapabilityDescription
OCRExtract printed and handwritten text from images
Document LayoutDetect headers, paragraphs, tables, figures
Image DescriptionGenerate natural-language descriptions
Content ClassificationIdentify content type (printed, handwritten, document, photo)
CLIP EmbeddingsGenerate 512-d vectors for semantic image search
Region DetectionLocate text regions with bounding boxes and confidence

Quick Start

import { createVisionPipeline } from '@framers/agentos';
import { readFileSync } from 'node:fs';

// Create a pipeline with progressive strategy (local-first, cloud fallback)
const vision = await createVisionPipeline({ strategy: 'progressive' });

// Process an image
const image = readFileSync('./document.png');
const result = await vision.process(image);

console.log(result.text); // Extracted text content
console.log(result.confidence); // Overall confidence (0.0–1.0)
console.log(result.contentType); // 'printed' | 'handwritten' | 'document-layout' | 'photograph'
console.log(result.regions); // Array of detected text regions with bounding boxes
console.log(result.tierBreakdown); // Which tiers ran and their timing

CLI

# Extract text from an image
wunderland vision ocr ./document.png

# Describe an image
wunderland vision describe ./photo.jpg

# Generate a CLIP embedding
wunderland vision embed ./image.png

Architecture

The pipeline uses a 3-tier progressive enhancement model. Each tier represents a different level of capability and cost:

┌─────────────────────────────────────────────┐
│ VisionPipeline │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Tier 0 │→ │ Tier 1 │→ │ Tier 2 │ │
│ │ Local │ │ Enhanced │ │ Cloud │ │
│ │ OCR │ │ Local │ │ Vision │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ PaddleOCR TrOCR/ Google Cloud │
│ Tesseract Florence-2 Vision / GPT │
│ CLIP Vision │
└─────────────────────────────────────────────┘

The pipeline dispatches to available tiers based on the chosen strategy, falling back gracefully when a tier is unavailable or fails.


Three-Tier Progressive Enhancement

Tier 0 — Local OCR (Lightweight)

The fastest tier, optimized for printed text. Zero cloud dependency.

ProviderInstallStrengths
PaddleOCRnpm install ppu-paddle-ocrSOTA accuracy for printed text, fast inference
Tesseract.jsnpm install tesseract.js100+ languages, widely supported, pure JS

Tier 0 runs first in the progressive strategy. If confidence is high enough (default threshold: 0.85), higher tiers are skipped.

const vision = await createVisionPipeline({
strategy: 'local-only',
tier0: {
provider: 'paddle-ocr', // 'paddle-ocr' | 'tesseract'
confidenceThreshold: 0.85,
languages: ['en'], // Tesseract language codes
},
});

Tier 1 — Enhanced Local (Transformers)

Uses @huggingface/transformers (already included in AgentOS) for capabilities beyond basic OCR:

ModelPurpose
TrOCRHandwriting recognition — outperforms Tesseract on cursive/informal text
Florence-2Document layout understanding — headers, tables, figures, reading order
CLIPImage embeddings for semantic search — 512-dimensional vectors

Tier 1 activates when Tier 0 confidence is below threshold or when the content type requires it (e.g., handwritten text, document layout analysis).

const vision = await createVisionPipeline({
strategy: 'progressive',
tier1: {
enableTrOCR: true, // Handwriting recognition
enableFlorence2: true, // Document layout
enableCLIP: true, // Image embeddings
modelCacheDir: '~/.cache/huggingface',
},
});

Tier 2 — Cloud Vision (Highest Accuracy)

Cloud APIs provide the highest accuracy and broadest capability set.

ProviderEnv VarStrengths
Google Cloud VisionGOOGLE_CLOUD_VISION_KEYDocument AI, handwriting, 100+ languages
OpenAI GPT VisionOPENAI_API_KEYImage understanding, description, reasoning
Anthropic VisionANTHROPIC_API_KEYDetailed analysis, document comprehension

Tier 2 is used as the final fallback in progressive strategy, or exclusively in cloud-only strategy.

const vision = await createVisionPipeline({
strategy: 'progressive',
tier2: {
provider: 'google-cloud-vision', // 'google-cloud-vision' | 'openai' | 'anthropic'
maxCostPerRequest: 0.02, // Budget cap per image (USD)
},
});

Processing Strategies

Four strategies control how tiers are orchestrated:

progressive (Default)

Starts at Tier 0, escalates to higher tiers only when confidence is below threshold. Optimal cost/quality tradeoff.

const vision = await createVisionPipeline({ strategy: 'progressive' });
// Tier 0 → if low confidence → Tier 1 → if still low → Tier 2

local-only

Never calls cloud APIs. Uses only Tier 0 and Tier 1. Suitable for air-gapped environments, privacy-sensitive content, or cost elimination.

const vision = await createVisionPipeline({ strategy: 'local-only' });
// Tier 0 → Tier 1 (never Tier 2)

cloud-only

Skips local processing entirely and sends directly to cloud APIs. Best when accuracy is paramount and latency/cost are acceptable.

const vision = await createVisionPipeline({ strategy: 'cloud-only' });
// Tier 2 only

parallel

Runs all available tiers simultaneously and merges results. Highest accuracy but also highest resource usage. Useful for critical document processing.

const vision = await createVisionPipeline({ strategy: 'parallel' });
// Tier 0 + Tier 1 + Tier 2 in parallel → merge best results

Content Detection

The pipeline automatically classifies the type of visual content to route processing appropriately:

Content TypeDescriptionBest Tier
printedMachine-printed text (documents, signs, screenshots)Tier 0 (PaddleOCR)
handwrittenHandwritten or cursive textTier 1 (TrOCR)
document-layoutStructured documents with headers, tables, figuresTier 1 (Florence-2)
photographNatural photographs (people, scenes, objects)Tier 2 (GPT/Cloud Vision)

Content detection runs as a lightweight pre-processing step using CLIP zero-shot classification. You can also force a content type:

const result = await vision.process(image, {
contentType: 'handwritten', // Skip auto-detection, go directly to TrOCR
});

CLIP Embeddings

Generate 512-dimensional embedding vectors for semantic image search and retrieval. CLIP embeddings enable cross-modal search (find images by text description and vice versa).

import { createVisionPipeline } from '@framers/agentos';

const vision = await createVisionPipeline({
strategy: 'local-only',
tier1: { enableCLIP: true },
});

// Generate an embedding vector
const result = await vision.embed(imageBuffer);
console.log(result.embedding); // Float32Array(512)
console.log(result.model); // 'clip-vit-base-patch32'

// Use with vector stores for image search
import { createVectorStore } from '@framers/agentos';
const store = await createVectorStore({ type: 'hnsw' });
await store.upsert({ id: 'img-1', vector: result.embedding, metadata: { path: './photo.jpg' } });

// Search by text
const textEmbedding = await vision.embedText('a sunset over the ocean');
const matches = await store.query(textEmbedding, { topK: 5 });

Integration with Multimodal RAG

The vision pipeline integrates directly with the Multimodal RAG system for indexing and retrieving image content:

import { createAgent } from '@framers/agentos';

const agent = await createAgent({
ragMemory: {
enabled: true,
multimodal: true,
vision: {
strategy: 'progressive',
indexImages: true, // Auto-index images encountered in conversations
generateDescriptions: true, // Store natural-language descriptions alongside embeddings
},
},
});

// The agent can now answer questions about images in its memory
const response = await agent.chat('What did the receipt from yesterday say?');

VisionPipeline API Reference

Methods

MethodSignatureDescription
process()(image: Buffer, options?) => Promise<VisionResult>Full processing: OCR + content detection + layout
ocr()(image: Buffer, options?) => Promise<OcrResult>Text extraction only
describe()(image: Buffer, options?) => Promise<DescribeResult>Natural-language image description
embed()(image: Buffer) => Promise<EmbedResult>CLIP embedding vector
embedText()(text: string) => Promise<Float32Array>Text embedding (same CLIP space)
detectContent()(image: Buffer) => Promise<ContentType>Content type classification
dispose()() => Promise<void>Release model resources

createVisionPipeline() Options

interface VisionPipelineOptions {
/** Processing strategy. Default: 'progressive'. */
strategy: 'progressive' | 'local-only' | 'cloud-only' | 'parallel';

/** Tier 0 configuration (local OCR). */
tier0?: {
provider?: 'paddle-ocr' | 'tesseract';
confidenceThreshold?: number; // Default: 0.85
languages?: string[]; // Default: ['en']
};

/** Tier 1 configuration (enhanced local). */
tier1?: {
enableTrOCR?: boolean; // Default: true
enableFlorence2?: boolean; // Default: true
enableCLIP?: boolean; // Default: true
modelCacheDir?: string; // Default: ~/.cache/huggingface
};

/** Tier 2 configuration (cloud vision). */
tier2?: {
provider?: 'google-cloud-vision' | 'openai' | 'anthropic';
maxCostPerRequest?: number; // Budget cap in USD
};

/** Global options. */
maxImageSize?: number; // Max dimension in pixels (auto-resize). Default: 4096
timeout?: number; // Per-tier timeout in ms. Default: 30000
enableRegionDetection?: boolean; // Bounding boxes for text regions. Default: true
}

VisionResult Shape

interface VisionResult {
/** Extracted text content (all tiers merged). */
text: string;

/** Overall confidence score (0.0–1.0). */
confidence: number;

/** Detected content type. */
contentType: 'printed' | 'handwritten' | 'document-layout' | 'photograph';

/** Detected text regions with bounding boxes and per-region confidence. */
regions: VisionRegion[];

/** Document layout elements (headers, paragraphs, tables, figures). */
layout?: LayoutElement[];

/** Natural-language description of the image (if Tier 2 ran). */
description?: string;

/** CLIP embedding vector (if enabled). */
embedding?: Float32Array;

/** Breakdown of which tiers ran and their timing. */
tierBreakdown: TierReport[];
}

interface VisionRegion {
text: string;
confidence: number;
bbox: { x: number; y: number; width: number; height: number };
tier: 0 | 1 | 2;
}

interface TierReport {
tier: 0 | 1 | 2;
provider: string;
durationMs: number;
confidence: number;
skipped: boolean;
skipReason?: string;
}

Installation

The vision pipeline works out of the box with @huggingface/transformers (bundled with AgentOS). Install optional providers for better results:

# Tier 0: Local OCR providers (install one or both)
npm install ppu-paddle-ocr # SOTA OCR for printed text (~50 MB)
npm install tesseract.js # Fallback OCR, 100+ languages (~15 MB)

# Tier 1: Already included via @huggingface/transformers
# TrOCR, Florence-2, and CLIP models are downloaded on first use (~200 MB each)

# Tier 2: Cloud providers — just set environment variables
export GOOGLE_CLOUD_VISION_KEY=your-key # Google Cloud Vision
export OPENAI_API_KEY=sk-... # OpenAI GPT Vision
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic Vision

Provider Configuration

PaddleOCR (Tier 0)

npm install ppu-paddle-ocr

PaddleOCR is a PaddlePaddle-based OCR engine with state-of-the-art accuracy on printed text. It supports English, Chinese, Japanese, Korean, and many other languages.

Tesseract.js (Tier 0)

npm install tesseract.js

Tesseract.js runs entirely in JavaScript/WASM. It supports 100+ languages and is a reliable fallback when PaddleOCR is not installed.

TrOCR (Tier 1)

Loaded automatically via @huggingface/transformers. Excels at handwriting recognition. The model (microsoft/trocr-base-handwritten) is downloaded on first use.

Florence-2 (Tier 1)

Loaded automatically via @huggingface/transformers. Provides document layout understanding including headers, tables, and reading order detection. The model (microsoft/Florence-2-base) is downloaded on first use.

CLIP (Tier 1)

Loaded automatically via @huggingface/transformers. Generates 512-d embedding vectors for semantic image search. The model (openai/clip-vit-base-patch32) is downloaded on first use.


Examples

Batch OCR Processing

import { createVisionPipeline } from '@framers/agentos';
import { readdir, readFile } from 'node:fs/promises';
import { join } from 'node:path';

const vision = await createVisionPipeline({ strategy: 'progressive' });

const imageDir = './scanned-documents';
const files = (await readdir(imageDir)).filter(f => /\.(png|jpg|jpeg|tiff)$/i.test(f));

for (const file of files) {
const image = await readFile(join(imageDir, file));
const result = await vision.process(image);
console.log(`${file}: ${result.contentType} (${(result.confidence * 100).toFixed(1)}%)`);
console.log(` Text: ${result.text.slice(0, 200)}...`);
console.log(` Tiers: ${result.tierBreakdown.map(t => `T${t.tier}:${t.durationMs}ms`).join(', ')}`);
}

await vision.dispose();

Image Search Index

import { createVisionPipeline, createVectorStore } from '@framers/agentos';
import { readFile } from 'node:fs/promises';

const vision = await createVisionPipeline({
strategy: 'local-only',
tier1: { enableCLIP: true },
});
const store = await createVectorStore({ type: 'hnsw', dimensions: 512 });

// Index images
const images = ['photo1.jpg', 'photo2.jpg', 'photo3.jpg'];
for (const path of images) {
const buffer = await readFile(path);
const { embedding } = await vision.embed(buffer);
await store.upsert({ id: path, vector: embedding, metadata: { path } });
}

// Search by text description
const query = await vision.embedText('a dog playing in the park');
const results = await store.query(query, { topK: 3 });
console.log('Best matches:', results.map(r => r.metadata.path));