Skip to main content

Document Ingestion

The Memory ingestion pipeline converts external documents into searchable memory traces. It handles format detection, multi-tier PDF extraction, four chunking strategies, folder scanning with glob patterns, and optional vision-LLM image captioning.


Overview

The ingestion pipeline transforms files and URLs into chunked, searchable memory traces stored in the agent's brain.sqlite:

Source (file / directory / URL)


┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ LoaderRegistry │────▶│ ChunkingEngine │────▶│ SqliteBrain │
│ (format detect) │ │ (split + index) │ │ (store traces) │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ MultimodalAggr. │ │ FTS5 + Graph │
│ (image captions) │ │ (search index) │
└──────────────────┘ └──────────────────┘

Quick Start

import { Memory } from '@framers/agentos';

const mem = new Memory({ path: './brain.sqlite' });

// Single file
await mem.ingest('./report.pdf');

// Directory with glob filters
await mem.ingest('./docs', {
recursive: true,
include: ['**/*.md', '**/*.pdf'],
exclude: ['**/node_modules/**'],
});

// URL
await mem.ingest('https://example.com/api-docs');

await mem.close();

Supported File Types

FormatExtensionsLoaderNotes
PDF.pdfPdfLoader3-tier extraction (see below)
DOCX.docxDocxLoaderAlso supports Docling for high fidelity
HTML.html, .htmHtmlLoaderStrips scripts/styles, extracts text
Markdown.md, .mdxMarkdownLoaderPreserves heading structure for hierarchical chunking
Plain text.txtTextLoaderDirect pass-through
CSV.csvCsvLoaderEach row becomes a trace or chunk
JSON.jsonJsonLoaderExtracts string values recursively
YAML.yaml, .ymlYamlLoaderConverted to JSON, then extracted
URLshttp://, https://UrlLoaderFetches content, then routes to appropriate loader

The LoaderRegistry auto-detects the correct loader based on file extension. When Docling or OCR loaders are available in the environment, they automatically override the default handlers for PDF and DOCX.


3-Tier PDF Extraction

PDF extraction uses a cascading strategy that maximises text fidelity while remaining zero-dependency by default:

TierEngineActivationFidelityDependencies
Tier 1unpdfAlways (built-in)Good for born-digital PDFsNone (pure JS)
Tier 2tesseract.js OCRAuto when Tier 1 yields sparse text (< 50 chars/page average)Handles scanned documentspnpm add tesseract.js
Tier 3Docling sidecarOpt-in via doclingEnabled: trueHighest fidelity (tables, layouts, figures)pip install docling

How the Cascade Works

PDF buffer arrives


[Tier 1: unpdf]

├── Text extraction succeeds and is dense?
│ └── YES → Return extracted text

└── Sparse text (< 50 chars/page)?


[Tier 2: tesseract.js OCR] (if installed)

└── Return OCR text

[Tier 3: Docling] (if doclingEnabled)

└── Bypasses Tier 1+2 entirely
Runs `python3 -m docling --output-format json <file>`
Returns high-fidelity structured extraction

Configuration

const mem = new Memory({
path: './brain.sqlite',
ingestion: {
extractImages: true, // Pull images from PDFs/DOCX
ocrEnabled: true, // Allow tesseract.js fallback
doclingEnabled: false, // Opt into Docling sidecar
},
});

Chunking Strategies

The ChunkingEngine splits document text into indexable chunks. Four strategies are available:

StrategyBest ForAlgorithm
fixedGeneral-purpose, predictable sizingSplit at character count with word-boundary awareness and configurable overlap
semanticTopic-coherent chunksEmbed individual sentences, split where cosine similarity drops below threshold
hierarchicalMarkdown documents with heading structureEach heading creates a chunk boundary; long sections sub-split with fixed
layoutCode-heavy or table-heavy documentsPreserve fenced code blocks and pipe-delimited tables as atomic chunks

Configuration

const mem = new Memory({
path: './brain.sqlite',
ingestion: {
chunkStrategy: 'semantic', // 'fixed' | 'semantic' | 'hierarchical' | 'layout'
chunkSize: 512, // Target characters per chunk
chunkOverlap: 64, // Overlap between consecutive chunks
},
});

Strategy Details

Fixed splits at a fixed character count, snapping to the nearest word boundary. The chunkOverlap parameter (default 64 chars) controls how much text is repeated between consecutive chunks to prevent context loss at split boundaries.

Semantic requires an embedding function. It embeds individual sentences, then splits wherever cosine similarity between adjacent sentences drops below a threshold (topic boundary detection). Falls back to fixed when no embedFn is supplied.

Hierarchical respects Markdown heading structure (#, ##, ###, etc.). Each heading creates a new chunk boundary, with the heading text stored in chunk metadata. Sections that exceed chunkSize are sub-split using the fixed strategy.

Layout detects fenced code blocks (```) and pipe-delimited tables (| col |) and preserves them as atomic chunks. Surrounding prose is split with fixed. This prevents code snippets and data tables from being cut mid-content.


FolderScanner

FolderScanner provides recursive directory ingestion with glob-based filtering via minimatch:

// Ingest an entire documentation folder
const result = await mem.ingest('./project/docs', {
recursive: true,
include: ['**/*.md', '**/*.pdf', '**/*.txt'],
exclude: ['**/node_modules/**', '**/.git/**', '**/dist/**'],
onProgress: (processed, total, current) => {
console.log(`[${processed}/${total}] ${current}`);
},
});

console.log(`Succeeded: ${result.succeeded.length}`);
console.log(`Failed: ${result.failed.length}`);
console.log(`Chunks created: ${result.chunksCreated}`);
console.log(`Traces created: ${result.tracesCreated}`);

Behaviour

  • When recursive is false (default), only direct children of the directory are processed.
  • include patterns are evaluated first; only matching files are considered.
  • exclude patterns are evaluated second; matching files are skipped.
  • Patterns are matched against the path relative to the scanned root directory.
  • A single unreadable or unparseable file never aborts the entire scan --- errors are collected in result.failed.
  • The onProgress callback fires after each file attempt (success or failure).

IngestResult

interface IngestResult {
succeeded: string[]; // Absolute paths of ingested files
failed: Array<{ path: string; error: string }>; // Files that could not be processed
chunksCreated: number; // Total chunks stored
tracesCreated: number; // Total memory traces created
}

MultimodalAggregator

When extractImages: true is configured, document loaders (PDF, DOCX) extract embedded images as ExtractedImage objects. The MultimodalAggregator enriches them with natural-language captions via a vision-capable LLM:

const mem = new Memory({
path: './brain.sqlite',
ingestion: {
extractImages: true,
visionLlm: 'gpt-4o', // Model used for image captioning
},
});

await mem.ingest('./slides.pdf');
// Images are extracted, captioned, and stored in document_images table

How It Works

  1. Document loaders produce ExtractedImage objects (raw bytes + MIME type + optional page number).
  2. MultimodalAggregator receives the image batch and calls the describeImage function for each image lacking a caption.
  3. Images are processed in parallel via Promise.allSettled --- a single failed captioning attempt does not block the rest.
  4. Failed images retain their un-captioned state rather than propagating errors.
  5. Captions are stored in the document_images.caption column and indexed for text retrieval.

Passthrough Mode

When no describeImage function is configured, the aggregator passes images through unchanged. This is the default behaviour when visionLlm is not set.


URL Ingestion

The UrlLoader fetches content from HTTP/HTTPS URLs and routes it through the appropriate document loader:

// Single URL
await mem.ingest('https://docs.example.com/guide');

// The UrlLoader:
// 1. Fetches the URL via HTTP GET
// 2. Detects content type from response headers
// 3. Routes to HtmlLoader, MarkdownLoader, etc.
// 4. Chunks and stores as memory traces

Idempotent Re-Ingestion

Every ingested document is tracked in the documents table with a SHA-256 content_hash. When the same file is ingested again:

  • If the content hash matches the previously ingested version, the file is skipped.
  • If the content has changed, the old chunks are replaced with the new extraction.

This makes it safe to re-run ingestion on the same directory without creating duplicates.


Configuration Reference

All ingestion options can be set at the Memory constructor level (applied to every ingest() call) or per-call:

OptionDefaultDescription
chunkStrategy'semantic'Chunking algorithm: fixed, semantic, hierarchical, layout
chunkSize512Target character count per chunk
chunkOverlap64Character overlap between consecutive chunks
extractImagesfalseExtract embedded images from PDF/DOCX
ocrEnabledfalseAllow tesseract.js fallback for sparse PDFs
doclingEnabledfalseUse Docling sidecar for high-fidelity extraction
visionLlmundefinedVision model for image captioning
recursivefalseDescend into subdirectories (per-call)
includeundefinedGlob patterns to include (per-call)
excludeundefinedGlob patterns to exclude (per-call)

Source Files

FilePurpose
memory/ingestion/LoaderRegistry.tsAuto-detection and loader dispatch
memory/ingestion/PdfLoader.ts3-tier PDF extraction (unpdf + OCR + Docling)
memory/ingestion/OcrPdfLoader.tstesseract.js OCR fallback
memory/ingestion/DoclingLoader.tsPython Docling sidecar
memory/ingestion/FolderScanner.tsRecursive directory walking
memory/ingestion/ChunkingEngine.ts4-strategy chunking
memory/ingestion/MultimodalAggregator.tsImage caption enrichment
memory/ingestion/UrlLoader.tsHTTP/HTTPS URL fetching
memory/facade/types.tsIngestOptions, IngestResult, IngestionConfig