Multimodal RAG (Image + Audio)
AgentOS’ core RAG APIs are text-first (EmbeddingManager + VectorStoreManager + RetrievalAugmentor). Multimodal support (image/audio) is implemented as a composable pattern on top:
- Store the binary asset (optional) + metadata.
- Derive a text representation (caption/transcript/OCR/document text extraction).
- Index that text as a normal RAG document so the existing retrieval pipeline (vector, BM25, reranking, GraphRAG, etc.) can operate without any “special” multimodal database.
- Optionally add modality-specific embeddings (image-to-image / audio-to-audio) as a fast path.
This guide documents the reference implementation used by the AgentOS HTTP API router (@framers/agentos-ext-http-api) and the voice-chat-assistant backend.
This is a strong production baseline, not a claim that AgentOS already ships the full current frontier of multimodal retrieval research. Today the canonical retrieval surface is still derived text. Direct visual late-interaction retrievers and page-native document retrieval remain follow-up work.
Why This Design
- Works by default: If you can derive text, you can retrieve multimodal assets immediately using the standard RAG pipeline.
- Optional offline: Image/audio embedding retrieval is install-on-demand and can be enabled per deployment.
- No vendor lock-in: The same abstractions work with
SqlVectorStore,HnswlibVectorStore, orQdrantVectorStore.
Architecture
Ingest (image/audio/document)
Key idea: the derived text is the canonical retrieval surface. Modality embeddings (when enabled) are an acceleration path, not a requirement. Documents are first-class assets in the same model, but stay text-first for now.
Query
Query-by-image and query-by-audio support a unified retrievalMode contract:
auto(default): text-first retrieval, with native modality retrieval added when availabletext: derived text onlynative: modality-native embeddings onlyhybrid: fuse text + native retrieval when both are available
Text queries over /multimodal/query can search any combination of image, audio, and document assets.
Data Model (Reference Backend)
The backend stores multimodal asset metadata in a dedicated SQL table (name depends on the configured RAG table prefix):
media_assets.asset_id(string) is the stable identifier.media_assets.modalityisimage,audio, ordocument.media_assets.collection_idis the “base” collection that the derived text is indexed into.store_payloadcontrols whether raw bytes are persisted.metadata_json,tags_json,source_url,mime_type,original_file_nameare stored for filtering/display.
The derived text representation is indexed as a normal RAG document:
documentId = assetIdcollectionId = media_images,media_audio, ormedia_documentsby default (configurable)- chunks are generated from
textRepresentation(usually 1 chunk unless you provide long text)
When offline embeddings are enabled, the reference backend also writes into embedding collections derived from the base collection:
- image embeddings:
${baseCollectionId}${suffix}(default suffix_img) - audio embeddings:
${baseCollectionId}${suffix}(default suffix_aud)
This keeps modality embeddings separate from text embeddings, while still reusing the same vector-store provider.
HTTP API Surface
The host-agnostic Express router lives in @framers/agentos-ext-http-api:
import express from 'express';
import { createAgentOSRagRouter } from '@framers/agentos-ext-http-api';
app.use(
'/api/agentos/rag',
createAgentOSRagRouter({
isEnabled: () => true,
ragService, // host-provided implementation
}),
);
It mounts multimodal routes under /multimodal/*:
POST /multimodal/images/ingest(multipart field:image)POST /multimodal/audio/ingest(multipart field:audio)POST /multimodal/documents/ingest(multipart field:document)POST /multimodal/query(search derived text)POST /multimodal/images/query(query-by-image)POST /multimodal/audio/query(query-by-audio)GET /multimodal/assets/:assetIdGET /multimodal/assets/:assetId/content(only if payload is stored)DELETE /multimodal/assets/:assetId
See BACKEND_API.md for request/response examples and deployment notes.
Offline Embeddings (Optional)
Offline embeddings are disabled by default and are install-on-demand:
- Image embeddings: require Transformers.js (
@huggingface/transformerspreferred;@xenova/transformerssupported). - Audio embeddings: requires Transformers.js and WAV decoding support via
wavefile(Node-only in the reference backend).
When offline embeddings are not enabled (or deps are missing), the system falls back to:
- query-by-image: caption the query image, then run text retrieval
- query-by-audio: transcribe the query audio, then run text retrieval
- document ingest: parse PDF/DOCX/TXT/MD/CSV/JSON/XML into derived text, then run normal text retrieval
Both query endpoints accept:
textRepresentationto bypass captioning/transcriptionretrievalMode=auto|text|native|hybridto control the planner
auto keeps derived text as the canonical retrieval layer and only adds native retrieval opportunistically.
Additional compatibility notes:
- Multipart query fields such as
modalitiesandcollectionIdsmay be sent as comma-separated strings (image,audio,docs,media_images) by higher-level clients such as Wunderland. - Document assets can be searched through
/multimodal/querywithmodalities:["document"]or mixed alongside image/audio assets. - Document parsing in the reference backend currently supports PDF, DOCX, TXT, Markdown, CSV, JSON, and XML.
- PDFs that contain no embedded text still need a page-image OCR/vision pipeline; the current backend surfaces that as an explicit extraction error instead of silently indexing nothing.
- Ollama can be used for image captioning when the selected model supports vision input and the caller sends image bytes as an inline
data:URL. Remote image URLs are not converted automatically for Ollama in the current provider adapter. - Audio embedding retrieval is still WAV-only in the Node reference backend. Non-WAV audio still works via transcript-first retrieval.