Skip to main content

Class: MultimodalIndexer

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:125

Indexes non-text content (images, audio) into the vector store by generating text descriptions and embeddings.

Image indexing flow

  1. If the image is a Buffer, convert to base64 data URL.
  2. Send to the vision LLM to generate a text description.
  3. Embed the description via the embedding manager.
  4. Store in the vector store with modality: 'image' metadata.

Audio indexing flow

  1. Send the audio buffer to the STT provider for transcription.
  2. Embed the transcript via the embedding manager.
  3. Store in the vector store with modality: 'audio' metadata.
  1. Embed the text query via the embedding manager.
  2. Query the vector store with optional modality filters.
  3. Return results annotated with their source modality.

Example

import { MultimodalIndexer } from '@framers/agentos/rag/multimodal';

const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionProvider,
sttProvider,
});

// Index an image
const imgResult = await indexer.indexImage({
image: fs.readFileSync('./photo.jpg'),
metadata: { source: 'upload' },
});

// Index audio
const audioResult = await indexer.indexAudio({
audio: fs.readFileSync('./meeting.wav'),
language: 'en',
});

// Search across all modalities
const results = await indexer.search('cats on a beach');

Constructors

Constructor

new MultimodalIndexer(deps): MultimodalIndexer

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:201

Create a new multimodal indexer.

Parameters

deps

Dependency injection container.

config?

MultimodalIndexerConfig

Optional configuration overrides.

embeddingManager

IEmbeddingManager

Manager for generating text embeddings.

sttProvider?

ISpeechToTextProvider

Optional STT provider for audio transcription.

vectorStore

IVectorStore

Vector store for document storage and search.

visionPipeline?

VisionPipeline

Optional full vision pipeline with OCR, handwriting, document understanding, CLIP embeddings, and cloud fallback. When provided, it is wrapped as an IVisionProvider via PipelineVisionProvider, overriding any visionProvider passed alongside it.

visionProvider?

IVisionProvider

Optional vision LLM for image description.

Returns

MultimodalIndexer

Throws

If embeddingManager or vectorStore is missing.

Example

// With a simple vision LLM provider
const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionProvider: myVisionLLM,
sttProvider: myWhisperService,
config: { defaultCollection: 'knowledge' },
});

// With the full vision pipeline (recommended)
const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionPipeline: myVisionPipeline,
});

Methods

createMemoryBridge()

createMemoryBridge(memoryManager?, options?): MultimodalMemoryBridge

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:595

Create a MultimodalMemoryBridge using this indexer's providers.

The bridge extends this indexer's RAG capabilities with cognitive memory integration, enabling multimodal content to be stored in both the vector store (for search) and long-term memory (for recall during conversation).

Parameters

memoryManager?

ICognitiveMemoryManager

Optional cognitive memory manager for memory trace creation. When omitted, the bridge still indexes into RAG but creates no memory traces.

options?

MultimodalBridgeOptions

Bridge configuration overrides (mood, chunk sizes, etc.)

Returns

MultimodalMemoryBridge

A configured multimodal memory bridge instance.

Example

const bridge = indexer.createMemoryBridge(memoryManager, {
enableMemory: true,
defaultChunkSize: 800,
});

await bridge.ingestImage(imageBuffer, { source: 'user-upload' });

See MultimodalMemoryBridge for full documentation.


indexAudio()

indexAudio(opts): Promise<AudioIndexResult>

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:391

Index an audio file by transcribing via STT, then embedding and storing the transcript.

Parameters

opts

AudioIndexOptions

Audio data, metadata, collection, and language options.

Returns

Promise<AudioIndexResult>

The document ID and generated transcript.

Throws

If no STT provider is configured.

Throws

If the STT provider fails to transcribe.

Throws

If embedding generation or vector store upsert fails.

Example

const result = await indexer.indexAudio({
audio: fs.readFileSync('./podcast.mp3'),
metadata: { source: 'podcast', episode: 42 },
language: 'en',
});
console.log(result.transcript); // "Welcome to episode 42..."

indexImage()

indexImage(opts): Promise<ImageIndexResult>

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:298

Index an image by generating a text description via vision LLM, then embedding and storing the description.

Parameters

opts

ImageIndexOptions

Image data, metadata, and collection options.

Returns

Promise<ImageIndexResult>

The document ID and generated description.

Throws

If no vision provider is configured.

Throws

If the vision LLM fails to describe the image.

Throws

If embedding generation or vector store upsert fails.

Example

const result = await indexer.indexImage({
image: 'https://example.com/photo.jpg',
metadata: { source: 'web-scrape', url: 'https://example.com' },
});
console.log(result.description); // "A golden retriever playing fetch..."

search(query, opts?): Promise<MultimodalSearchResult[]>

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:479

Search across all modalities (text + image descriptions + audio transcripts).

The query text is embedded, then the vector store is searched with optional modality filtering. Results are returned with their source modality indicated.

Parameters

query

string

Natural language search query.

opts?

MultimodalSearchOptions

Optional search parameters (topK, modalities, collection).

Returns

Promise<MultimodalSearchResult[]>

Array of search results sorted by relevance score (descending).

Throws

If embedding generation fails.

Example

// Search only image descriptions
const imageResults = await indexer.search('cats playing', {
modalities: ['image'],
topK: 10,
});

// Search across all modalities
const allResults = await indexer.search('machine learning tutorial');

setHydeRetriever()

setHydeRetriever(retriever): void

Defined in: packages/agentos/src/rag/multimodal/MultimodalIndexer.ts:270

Attach a HyDE retriever to enable hypothesis-driven multimodal search.

Once set, pass hyde: { enabled: true } in the search() options to activate HyDE for that query. The retriever generates a hypothetical answer using an LLM, then embeds that answer instead of the raw query text, which typically yields better recall for exploratory queries.

Parameters

retriever

HydeRetriever

A pre-configured HydeRetriever instance.

Returns

void

Example

indexer.setHydeRetriever(new HydeRetriever({
llmCaller: myLlmCaller,
embeddingManager: myEmbeddingManager,
config: { enabled: true },
}));

const results = await indexer.search('cats on a beach', {
hyde: { enabled: true },
});