Skip to main content

Class: VisionPipeline

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:127

Unified vision pipeline with progressive enhancement.

Processes images through up to three tiers of increasing capability:

  1. Local OCR (PaddleOCR / Tesseract.js) — fast, free, offline
  2. Local Vision Models (TrOCR / Florence-2 / CLIP) — offline but slower
  3. Cloud Vision LLMs (GPT-4o, Claude, Gemini) — best quality, API cost

All heavy dependencies are loaded lazily on first use. The pipeline never imports ML libraries at module load time, so it's safe to instantiate even when optional peer deps are missing — errors only surface when a tier that needs them actually runs.

See

createVisionPipeline for automatic provider detection.

Constructors

Constructor

new VisionPipeline(config): VisionPipeline

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:177

Create a new vision pipeline.

Parameters

config

VisionPipelineConfig

Pipeline configuration. All heavy dependencies are loaded lazily, so construction is synchronous and never imports ML libraries.

Returns

VisionPipeline

Example

const pipeline = new VisionPipeline({
strategy: 'progressive',
ocr: 'paddle',
handwriting: true,
cloudProvider: 'openai',
});

Methods

analyzeLayout()

analyzeLayout(image): Promise<DocumentLayout>

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:479

Analyze document layout using Florence-2 — document-ai tier only.

Returns structured DocumentLayout with semantic blocks (text, tables, figures, headings, lists, code) and their bounding boxes within each page.

Parameters

image

Image data as a Buffer or file-path / URL string.

string | Buffer

Returns

Promise<DocumentLayout>

Structured document layout with pages and blocks.

Throws

If @huggingface/transformers is not installed.

Throws

If Florence-2 model loading fails.

Example

const layout = await pipeline.analyzeLayout(documentScan);
for (const page of layout.pages) {
for (const block of page.blocks) {
console.log(`${block.type}: ${block.content.slice(0, 50)}...`);
}
}

dispose()

dispose(): Promise<void>

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:506

Shut down the pipeline and release all loaded model resources.

After calling dispose(), any further calls to process(), extractText(), embed(), or analyzeLayout() will throw.

Returns

Promise<void>

Example

const pipeline = new VisionPipeline({ strategy: 'progressive' });
try {
const result = await pipeline.process(image);
} finally {
await pipeline.dispose();
}

embed()

embed(image): Promise<number[]>

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:442

Generate an image embedding using CLIP — embedding tier only.

Useful for building image similarity search indexes without running the full OCR + vision pipeline.

Parameters

image

Image data as a Buffer or file-path / URL string.

string | Buffer

Returns

Promise<number[]>

CLIP embedding vector (typically 512 or 768 dimensions).

Throws

If @huggingface/transformers is not installed.

Throws

If CLIP model loading fails.

Example

const embedding = await pipeline.embed(photoBuffer);
await vectorStore.upsert('images', [{
id: 'photo-1',
embedding,
metadata: { source: 'upload' },
}]);

extractText()

extractText(image): Promise<string>

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:409

Extract text only — fastest path using OCR tier exclusively.

Ignores all other tiers (handwriting, document-ai, cloud, embedding). Useful when you just need the text content and don't need confidence scoring, layout analysis, or embeddings.

Parameters

image

Image data as a Buffer or file-path / URL string.

string | Buffer

Returns

Promise<string>

Extracted text, or empty string if OCR produces no output.

Throws

If the configured OCR engine is missing.

Example

const text = await pipeline.extractText(receiptImage);
console.log(text); // "ACME STORE\n...\nTotal: $42.99"

process()

process(image, options?): Promise<VisionResult>

Defined in: packages/agentos/src/core/vision/VisionPipeline.ts:222

Process an image through the configured tiers.

Automatically detects content type (printed text, handwritten, diagram, etc.) and routes through the appropriate processing tiers based on the configured VisionStrategy.

Parameters

image

Image data as a Buffer or file-path / URL string. Buffers are preprocessed with sharp (if configured). URL strings are passed directly to providers that support them.

string | Buffer

options?

Optional overrides for this specific invocation.

forceCategory?

VisionContentCategory

Force a specific content category instead of auto-detecting from OCR confidence heuristics.

tiers?

VisionTier[]

Run only these specific tiers, ignoring the strategy's normal routing logic.

Returns

Promise<VisionResult>

Aggregated vision result with text, confidence, embeddings, etc.

Throws

If all configured tiers fail to produce a result.

Throws

If a required dependency (e.g. ppu-paddle-ocr) is missing.

Throws

If dispose() was already called.

Example

// Full progressive pipeline
const result = await pipeline.process(imageBuffer);

// Force handwriting mode
const hw = await pipeline.process(scanBuffer, {
forceCategory: 'handwritten',
});

// Only run OCR and embedding, skip everything else
const partial = await pipeline.process(imageBuffer, {
tiers: ['ocr', 'embedding'],
});