Multimodal RAG (Image + Audio)
Memory benchmarks (full N=500, gpt-4o reader): 85.6% on LongMemEval-S at $0.0090 per correct, +1.4 points above Mastra Observational Memory (84.23%). 70.2% on LongMemEval-M on the 1.5M-token / 500-session haystack variant — the only open-source library on the public record above 65% on M with publicly reproducible methodology. The same text-first retrieval pipeline that produced these numbers is what the multimodal pattern below indexes against (derived captions, transcripts, OCR, document text) once you have a text representation. Benchmarks · Run JSONs · SOTA writeup
AgentOS’ core RAG APIs are text-first (EmbeddingManager + VectorStoreManager + RetrievalAugmentor). Multimodal support (image/audio) is implemented as a composable pattern on top:
- Store the binary asset (optional) + metadata.
- Derive a text representation (caption/transcript/OCR/document text extraction).
- Index that text as a normal RAG document so the existing retrieval pipeline (vector, BM25, reranking, GraphRAG, etc.) can operate without any “special” multimodal database.
- Optionally add modality-specific embeddings (image-to-image / audio-to-audio) as a fast path.
This guide documents the reference implementation used by the AgentOS HTTP API router (@framers/agentos-ext-http-api) and the voice-chat-assistant backend.
This is a strong production baseline, not a claim that AgentOS already ships the full current frontier of multimodal retrieval research. Today the canonical retrieval surface is still derived text. Direct visual late-interaction retrievers and page-native document retrieval remain follow-up work.
Current implementation detail: PDF/document ingestion now indexes extracted text into standard RAG collections through MultimodalIndexer.indexText(...), so derived document text is retrievable through the normal text pipeline rather than only being stored as memory traces.
Why This Design
- Works by default: If you can derive text, you can retrieve multimodal assets immediately using the standard RAG pipeline.
- Optional offline: Image/audio embedding retrieval is install-on-demand and can be enabled per deployment.
- No vendor lock-in: The same abstractions work with
SqlVectorStore,HnswlibVectorStore, orQdrantVectorStore.
Architecture
Ingest (image/audio/document)
Key idea: the derived text is the canonical retrieval surface. Modality embeddings (when enabled) are an acceleration path, not a requirement. Documents are first-class assets in the same model, but stay text-first for now.
That text-first design has one important boundary today: UnifiedRetriever still treats its multimodal source as non-text-only. Document/PDF text retrieval therefore works through the standard text RAG collections rather than through the multimodal source branch in UnifiedRetriever.
Query
Query-by-image and query-by-audio support a unified retrievalMode contract:
auto(default): text-first retrieval, with native modality retrieval added when availabletext: derived text onlynative: modality-native embeddings onlyhybrid: fuse text + native retrieval when both are available
Text queries over /multimodal/query can search any combination of image, audio, and document assets.
Data Model (Reference Backend)
The backend stores multimodal asset metadata in a dedicated SQL table (name depends on the configured RAG table prefix):
media_assets.asset_id(string) is the stable identifier.media_assets.modalityisimage,audio, ordocument.media_assets.collection_idis the “base” collection that the derived text is indexed into.store_payloadcontrols whether raw bytes are persisted.metadata_json,tags_json,source_url,mime_type,original_file_nameare stored for filtering/display.
The derived text representation is indexed as a normal RAG document:
documentId = assetIdcollectionId = media_images,media_audio, ormedia_documentsby default (configurable)- chunks are generated from
textRepresentation(usually 1 chunk unless you provide long text)
When offline embeddings are enabled, the reference backend also writes into embedding collections derived from the base collection:
- image embeddings:
${baseCollectionId}${suffix}(default suffix_img) - audio embeddings:
${baseCollectionId}${suffix}(default suffix_aud)
This keeps modality embeddings separate from text embeddings, while still reusing the same vector-store provider.
HTTP API Surface
The host-agnostic Express router lives in @framers/agentos-ext-http-api:
import express from 'express';
import { createAgentOSRagRouter } from '@framers/agentos-ext-http-api';
app.use(
'/api/agentos/rag',
createAgentOSRagRouter({
isEnabled: () => true,
ragService, // host-provided implementation
}),
);
It mounts multimodal routes under /multimodal/*:
POST /multimodal/images/ingest(multipart field:image)POST /multimodal/audio/ingest(multipart field:audio)POST /multimodal/documents/ingest(multipart field:document)POST /multimodal/query(search derived text)POST /multimodal/images/query(query-by-image)POST /multimodal/audio/query(query-by-audio)GET /multimodal/assets/:assetIdGET /multimodal/assets/:assetId/content(only if payload is stored)DELETE /multimodal/assets/:assetId
See BACKEND_API.md for request/response examples and deployment notes.
Offline Embeddings (Optional)
Offline embeddings are disabled by default and are install-on-demand:
- Image embeddings: require Transformers.js (
@huggingface/transformerspreferred;@xenova/transformerssupported). - Audio embeddings: requires Transformers.js and WAV decoding support via
wavefile(Node-only in the reference backend).
When offline embeddings are not enabled (or deps are missing), the system falls back to:
- query-by-image: caption the query image, then run text retrieval
- query-by-audio: transcribe the query audio, then run text retrieval
- document ingest: parse PDF/DOCX/TXT/MD/CSV/JSON/XML into derived text, then run normal text retrieval
Both query endpoints accept:
textRepresentationto bypass captioning/transcriptionretrievalMode=auto|text|native|hybridto control the planner
auto keeps derived text as the canonical retrieval layer and only adds native retrieval opportunistically.
Additional compatibility notes:
- Multipart query fields such as
modalitiesandcollectionIdsmay be sent as comma-separated strings (image,audio,docs,media_images) by higher-level clients. - Document assets can be searched through
/multimodal/querywithmodalities:["document"]or mixed alongside image/audio assets. - Document parsing in the reference backend currently supports PDF, DOCX, TXT, Markdown, CSV, JSON, and XML.
- PDFs that contain no embedded text still need a page-image OCR/vision pipeline; the current backend surfaces that as an explicit extraction error instead of silently indexing nothing.
- Ollama can be used for image captioning when the selected model supports vision input and the caller sends image bytes as an inline
data:URL. Remote image URLs are not converted automatically for Ollama in the current provider adapter. - Audio embedding retrieval is still WAV-only in the Node reference backend. Non-WAV audio still works via transcript-first retrieval.
Configuration (Reference Backend)
These env vars control the multimodal behavior in the voice-chat-assistant backend:
AGENTOS_RAG_MEDIA_STORE_PAYLOAD=true|false(defaultfalse)AGENTOS_RAG_MEDIA_IMAGE_COLLECTION_ID(defaultmedia_images)AGENTOS_RAG_MEDIA_AUDIO_COLLECTION_ID(defaultmedia_audio)AGENTOS_RAG_MEDIA_DOCUMENT_COLLECTION_ID(defaultmedia_documents)AGENTOS_RAG_MEDIA_IMAGE_EMBEDDINGS_ENABLED=true|false(defaultfalse)AGENTOS_RAG_MEDIA_IMAGE_EMBED_MODEL(defaultXenova/clip-vit-base-patch32)AGENTOS_RAG_MEDIA_IMAGE_EMBED_COLLECTION_SUFFIX(default_img)AGENTOS_RAG_MEDIA_AUDIO_EMBEDDINGS_ENABLED=true|false(defaultfalse)AGENTOS_RAG_MEDIA_AUDIO_EMBED_MODEL(defaultXenova/clap-htsat-unfused)AGENTOS_RAG_MEDIA_AUDIO_EMBED_COLLECTION_SUFFIX(default_aud)AGENTOS_RAG_MEDIA_*_EMBED_CACHE_DIR(optional; recommended for servers to persist model downloads)
Extending To Video
The recommended approach is the same pattern:
- Persist video metadata and optional bytes.
- Derive one or more text representations (e.g. transcript, scene captions, frame OCR).
- Index derived text into a
media_videoscollection. - (Optional) add a video embedding collection for query-by-video.
This keeps the base retrieval system consistent while still allowing richer modality-specific paths.
References
Retrieval-augmented generation foundations
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020. — Original RAG paper. arXiv:2005.11401
- Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. EMNLP 2020. — Bi-encoder dense retrieval (the cosine-similarity layer in this pipeline). arXiv:2004.04906
Hybrid retrieval (dense + sparse + reranker)
- Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. — BM25 reference (the sparse arm of hybrid retrieval). DOI
- Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. SIGIR 2009. — RRF for fusing dense + sparse rankings. ACM DL
- Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv preprint. — Cross-encoder reranking principle behind the Cohere / Transformers.js rerank stage. arXiv:1901.04704
Hypothetical document expansion
- Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise zero-shot dense retrieval without relevance labels. arXiv preprint. — HyDE retrieval. arXiv:2212.10496
Graph-augmented retrieval
- Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv preprint. — Microsoft GraphRAG; community detection + summarization for multi-hop reasoning. arXiv:2404.16130
- Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008. — Louvain algorithm used by
GraphRAGEnginefor community detection. arXiv:0803.0476
Multimodal embeddings
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. ICML 2021. — CLIP, the foundation for image-text joint embeddings used in vision retrieval. arXiv:2103.00020
- Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., & Dubnov, S. (2023). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ICASSP 2023. — CLAP audio-text embeddings — referenced as
Xenova/clap-htsat-unfusedfor the audio retrieval path. arXiv:2211.06687
Vector indexing
- Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824–836. — HNSW algorithm behind the
HnswlibVectorStorebackend. arXiv:1603.09320
Implementation references
packages/agentos/src/rag/— vector stores, embeddings, fusion, reranking, GraphRAGpackages/agentos/src/memory/retrieval/hyde/MemoryHydeRetriever.ts— HyDE for memory-specific recallpackages/agentos/src/memory/retrieval/graph/graphrag/GraphRAGEngine.ts— Microsoft GraphRAG-style implementation