Skip to main content

Class: PdfLoader

Defined in: packages/agentos/src/memory/ingestion/PdfLoader.ts:93

Document loader for PDF files.

Extraction tiers

  1. unpdf — always used as the primary extraction engine. Performs pure-JS PDF text layer extraction with no native binaries required.
  2. OcrPdfLoader (optional) — supplied at construction time and engaged automatically when unpdf yields sparse text (< 50 chars per page on average), indicating a scanned document.
  3. DoclingLoader (optional) — when provided, takes precedence over both unpdf and OCR, yielding the highest-fidelity extraction at the cost of requiring a Python runtime.

Implements

Example

const ocrLoader    = createOcrPdfLoader();   // null if tesseract.js absent
const doclingLoader = createDoclingLoader(); // null if docling absent
const pdfLoader = new PdfLoader(ocrLoader, doclingLoader);
const doc = await pdfLoader.load('/reports/q3.pdf');

Implements

Constructors

Constructor

new PdfLoader(ocrLoader?, doclingLoader?): PdfLoader

Defined in: packages/agentos/src/memory/ingestion/PdfLoader.ts:116

Creates a new PdfLoader.

Parameters

ocrLoader?

Optional OCR fallback (e.g. from createOcrPdfLoader).

IDocumentLoader | null

doclingLoader?

Optional Docling loader (e.g. from createDoclingLoader).

IDocumentLoader | null

Returns

PdfLoader

Properties

supportedExtensions

readonly supportedExtensions: string[]

Defined in: packages/agentos/src/memory/ingestion/PdfLoader.ts:95

File extensions this loader handles, each with a leading dot.

Used by LoaderRegistry to route file paths to the correct loader.

Example

['.md', '.mdx']

Implementation of

IDocumentLoader.supportedExtensions

Methods

canLoad()

canLoad(source): boolean

Defined in: packages/agentos/src/memory/ingestion/PdfLoader.ts:129

Returns true when this loader is capable of handling source.

For string sources the check is purely extension-based. For Buffer sources the loader may inspect magic bytes when relevant.

Parameters

source

Absolute file path or raw bytes.

string | Buffer

Returns

boolean

Implementation of

IDocumentLoader.canLoad


load()

load(source, options?): Promise<LoadedDocument>

Defined in: packages/agentos/src/memory/ingestion/PdfLoader.ts:143

Parses source and returns a normalised LoadedDocument.

When source is a string the loader treats it as an absolute (or resolvable) file path and reads the file from disk. When source is a Buffer the loader parses the bytes directly and derives as much metadata as possible from the buffer content alone.

Parameters

source

Absolute file path OR raw document bytes.

string | Buffer

options?

LoadOptions

Optional hints such as a format override.

Returns

Promise<LoadedDocument>

A promise resolving to the fully-populated LoadedDocument.

Throws

When the file cannot be read or the format is not parsable.

Implementation of

IDocumentLoader.load