Class: MultimodalIndexer

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:130

Indexes non-text content (images, audio) into the vector store by generating text descriptions and embeddings.

Image indexing flow

If the image is a Buffer, convert to base64 data URL.
Send to the vision LLM to generate a text description.
Embed the description via the embedding manager.
Store in the vector store with modality: 'image' metadata.

Audio indexing flow

Send the audio buffer to the STT provider for transcription.
Embed the transcript via the embedding manager.
Store in the vector store with modality: 'audio' metadata.

Embed the text query via the embedding manager.
Query the vector store with optional modality filters.
Return results annotated with their source modality.

Example

import { MultimodalIndexer } from '@framers/agentos/cognition/rag';

const indexer = new MultimodalIndexer({
  embeddingManager,
  vectorStore,
  visionProvider,
  sttProvider,
});

// Index an image
const imgResult = await indexer.indexImage({
  image: fs.readFileSync('./photo.jpg'),
  metadata: { source: 'upload' },
});

// Index audio
const audioResult = await indexer.indexAudio({
  audio: fs.readFileSync('./meeting.wav'),
  language: 'en',
});

// Search across all modalities
const results = await indexer.search('cats on a beach');

Constructors

Constructor

new MultimodalIndexer(deps): MultimodalIndexer

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:206

Create a new multimodal indexer.

Parameters

deps

Dependency injection container.

config?

MultimodalIndexerConfig

Optional configuration overrides.

embeddingManager

IEmbeddingManager

Manager for generating text embeddings.

sttProvider?

ISpeechToTextProvider

Optional STT provider for audio transcription.

vectorStore

IVectorStore

Vector store for document storage and search.

visionPipeline?

VisionPipeline

Optional full vision pipeline with OCR, handwriting, document understanding, CLIP embeddings, and cloud fallback. When provided, it is wrapped as an IVisionProvider via PipelineVisionProvider, overriding any visionProvider passed alongside it.

visionProvider?

IVisionProvider

Optional vision LLM for image description.

Returns

MultimodalIndexer

Throws

If embeddingManager or vectorStore is missing.

Example

// With a simple vision LLM provider
const indexer = new MultimodalIndexer({
  embeddingManager,
  vectorStore,
  visionProvider: myVisionLLM,
  sttProvider: myWhisperService,
  config: { defaultCollection: 'knowledge' },
});

// With the full vision pipeline (recommended)
const indexer = new MultimodalIndexer({
  embeddingManager,
  vectorStore,
  visionPipeline: myVisionPipeline,
});

Methods

createMemoryBridge()

createMemoryBridge(memoryManager?, options?): MultimodalMemoryBridge

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:662

Create a MultimodalMemoryBridge using this indexer's providers.

The bridge extends this indexer's RAG capabilities with cognitive memory integration, enabling multimodal content to be stored in both the vector store (for search) and long-term memory (for recall during conversation).

Parameters

memoryManager?

ICognitiveMemoryManager

Optional cognitive memory manager for memory trace creation. When omitted, the bridge still indexes into RAG but creates no memory traces.

options?

MultimodalBridgeOptions

Bridge configuration overrides (mood, chunk sizes, etc.)

Returns

MultimodalMemoryBridge

A configured multimodal memory bridge instance.

Example

const bridge = indexer.createMemoryBridge(memoryManager, {
  enableMemory: true,
  defaultChunkSize: 800,
});

await bridge.ingestImage(imageBuffer, { source: 'user-upload' });

See MultimodalMemoryBridge for full documentation.

indexAudio()

indexAudio(opts): Promise<AudioIndexResult>

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:396

Index an audio file by transcribing via STT, then embedding and storing the transcript.

Parameters

opts

AudioIndexOptions

Audio data, metadata, collection, and language options.

Returns

Promise<AudioIndexResult>

The document ID and generated transcript.

Throws

If no STT provider is configured.

Throws

If the STT provider fails to transcribe.

Throws

If embedding generation or vector store upsert fails.

Example

const result = await indexer.indexAudio({
  audio: fs.readFileSync('./podcast.mp3'),
  metadata: { source: 'podcast', episode: 42 },
  language: 'en',
});
console.log(result.transcript); // "Welcome to episode 42..."

indexImage()

indexImage(opts): Promise<ImageIndexResult>

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:303

Index an image by generating a text description via vision LLM, then embedding and storing the description.

Parameters

opts

ImageIndexOptions

Image data, metadata, and collection options.

Returns

Promise<ImageIndexResult>

The document ID and generated description.

Throws

If no vision provider is configured.

Throws

If the vision LLM fails to describe the image.

Throws

If embedding generation or vector store upsert fails.

Example

const result = await indexer.indexImage({
  image: 'https://example.com/photo.jpg',
  metadata: { source: 'web-scrape', url: 'https://example.com' },
});
console.log(result.description); // "A golden retriever playing fetch..."

indexText()

indexText(opts): Promise<TextIndexResult>

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:473

Index plain text by embedding and storing it directly.

This is used when higher-level multimodal pipelines already have text extracted from rich content, such as PDF pages or OCR output, and need to place that text into the multimodal vector store without going through a vision or STT provider.

Parameters

opts

TextIndexOptions

Text, metadata, and collection options.

Returns

Promise<TextIndexResult>

The document ID and normalized indexed text.

Throws

If the text is empty after trimming.

Throws

If embedding generation or vector store upsert fails.

search()

search(query, opts?): Promise<MultimodalSearchResult[]>

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:546

Search across all modalities (text + image descriptions + audio transcripts).

The query text is embedded, then the vector store is searched with optional modality filtering. Results are returned with their source modality indicated.

Parameters

query

string

Natural language search query.

opts?

MultimodalSearchOptions

Optional search parameters (topK, modalities, collection).

Returns

Promise<MultimodalSearchResult[]>

Array of search results sorted by relevance score (descending).

Throws

If embedding generation fails.

Example

// Search only image descriptions
const imageResults = await indexer.search('cats playing', {
  modalities: ['image'],
  topK: 10,
});

// Search across all modalities
const allResults = await indexer.search('machine learning tutorial');

setHydeRetriever()

setHydeRetriever(retriever): void

Defined in: packages/agentos/src/cognition/rag/multimodal/MultimodalIndexer.ts:275

Attach a HyDE retriever to enable hypothesis-driven multimodal search.

Once set, pass hyde: { enabled: true } in the search() options to activate HyDE for that query. The retriever generates a hypothetical answer using an LLM, then embeds that answer instead of the raw query text, which typically yields better recall for exploratory queries.

Parameters

retriever

HydeRetriever

A pre-configured HydeRetriever instance.

Returns

void

Example

indexer.setHydeRetriever(new HydeRetriever({
  llmCaller: myLlmCaller,
  embeddingManager: myEmbeddingManager,
  config: { enabled: true },
}));

const results = await indexer.search('cats on a beach', {
  hyde: { enabled: true },
});

Image indexing flow​

Audio indexing flow​

Cross-modal search​

Example​

Constructors​

Constructor​

Parameters​

deps​

config?​

embeddingManager​

sttProvider?​

vectorStore​

visionPipeline?​

visionProvider?​

Returns​

Throws​

Example​

Methods​

createMemoryBridge()​

Parameters​

memoryManager?​

options?​

Returns​

Example​

indexAudio()​

Parameters​

opts​

Returns​

Throws​

Throws​

Throws​

Example​

indexImage()​

Parameters​

opts​

Returns​

Throws​

Throws​

Throws​

Example​

indexText()​

Parameters​

opts​

Returns​

Throws​

Throws​

search()​

Parameters​

query​

opts?​

Returns​

Throws​

Example​

setHydeRetriever()​

Parameters​

retriever​

Returns​

Example​

Image indexing flow

Audio indexing flow

Cross-modal search

Example

Constructors

Constructor

Parameters

deps

config?

embeddingManager

sttProvider?

vectorStore

visionPipeline?

visionProvider?

Returns

Throws

Example

Methods

createMemoryBridge()

Parameters

memoryManager?

options?

Returns

Example

indexAudio()

Parameters

opts

Returns

Throws

Throws

Throws

Example

indexImage()

Parameters

opts

Returns

Throws

Throws

Throws

Example

indexText()

Parameters

opts

Returns

Throws

Throws

search()

Parameters

query

opts?

Returns

Throws

Example

setHydeRetriever()

Parameters

retriever

Returns

Example