Skip to main content

ML Content Classifiers

Streaming ML content safety classification using BERT-family models with sliding-window chunk-based evaluation. Detects toxicity, prompt injection, and jailbreak attempts in real-time during LLM streaming.

Package: @framers/agentos-ext-ml-classifiers


Overview

The ML Content Classifiers extension provides two modes of operation:

  • Passive protection via a built-in guardrail that automatically classifies input and output content using three BERT-family models running in parallel
  • Active capability via an agent-callable tool (classify_content) for on-demand content safety classification

It detects:

  • Toxicity -- toxic, severe toxic, obscene, threat, insult, identity hate (via unitary/toxic-bert)
  • Prompt injection -- paraphrased, obfuscated, and indirect injection attacks (via protectai/deberta-v3-small-prompt-injection-v2)
  • Jailbreak -- role-play attacks, system prompt extraction, constraint bypasses (via meta-llama/PromptGuard-86M)

All models run via @huggingface/transformers with ONNX Runtime. Models are INT8 quantized by default for ~50% smaller footprint with minimal accuracy loss. Lazy-loaded on first use via ISharedServiceRegistry.


Installation

npm install @framers/agentos-ext-ml-classifiers

The extension requires @huggingface/transformers (already an AgentOS dependency):

npm install @huggingface/transformers

Usage

Direct factory usage

import { AgentOS } from '@framers/agentos';
import { createMLClassifierPack } from '@framers/agentos-ext-ml-classifiers';

const mlPack = createMLClassifierPack({
classifiers: {
toxicity: true,
injection: true,
jailbreak: true,
},
streamingMode: 'hybrid',
chunkSize: 200,
contextSize: 50,
});

const agent = new AgentOS();
await agent.initialize({
...config,
manifest: { packs: [{ factory: () => mlPack }] },
});

Manifest-based loading

await agent.initialize({
manifest: {
packs: [
{
package: '@framers/agentos-ext-ml-classifiers',
options: {
classifiers: { toxicity: true, injection: true, jailbreak: false },
streamingMode: 'non-blocking',
},
},
],
},
});

Via curated registry

import { createCuratedManifest } from '@framers/agentos-extensions-registry';

const manifest = await createCuratedManifest({
tools: ['ml-classifiers'],
channels: 'none',
});

Default Classifiers

Toxicity Classifier

PropertyValue
Modelunitary/toxic-bert (66M params, INT8 ~33MB)
Labelstoxic, severe_toxic, obscene, threat, insult, identity_hate
Latency~20ms CPU ONNX, ~5ms GPU
AUC98.28 mean across 6 categories
OutputMulti-label (each label scored independently 0.0--1.0)

Default thresholds: block > 0.9, flag > 0.7, warn > 0.4

Injection Classifier

PropertyValue
Modelprotectai/deberta-v3-small-prompt-injection-v2 (44M params, INT8 ~22MB)
LabelsINJECTION, SAFE (binary classification)
Latency~50ms CPU ONNX, ~15ms GPU
FocusParaphrased, obfuscated, and indirect injections via tool outputs or RAG context

Default thresholds: block INJECTION > 0.85, flag INJECTION > 0.5

Jailbreak Classifier

PropertyValue
Modelmeta-llama/PromptGuard-86M (86M params, INT8 ~43MB)
Labelsjailbreak, injection, benign (multi-class, one wins)
Latency~60ms CPU ONNX, ~15ms GPU
OriginMeta's LlamaFirewall

Default thresholds: block jailbreak > 0.8, flag jailbreak > 0.5 OR injection > 0.5


IContentClassifier Interface

Add custom classifiers by implementing the IContentClassifier interface:

interface IContentClassifier {
/** Unique identifier (e.g., 'my-custom-classifier') */
readonly id: string;
/** Human-readable display name */
readonly displayName: string;
/** What this classifier detects */
readonly description: string;
/** HuggingFace model ID or local path */
readonly modelId: string;
/** Whether the model is loaded and ready */
readonly isLoaded: boolean;

/** Classify text and return a ClassificationResult */
classify(text: string): Promise<ClassificationResult>;

/** Release model resources */
dispose?(): Promise<void>;
}

Register custom classifiers in the pack options:

const pack = createMLClassifierPack({
customClassifiers: [new MyCustomClassifier()],
classifiers: { toxicity: true }, // defaults still run alongside
});

SlidingWindowBuffer

The sliding window buffer manages token accumulation and context carry-forward for streaming classification. It decides when a chunk is ready for classification, decoupled from the classification logic itself.

How It Works

  1. TEXT_DELTA chunks feed into the buffer via push(streamId, text)
  2. Tokens accumulate until chunkSize (default 200) tokens are reached
  3. When ready, the buffer returns the chunk text with contextSize (default 50) tokens carried forward from the previous chunk's tail
  4. On stream end, flush() returns any remaining buffered text

Configuration

ParameterDefaultDescription
chunkSize200Tokens to accumulate before triggering classification. Larger = better accuracy, slower detection.
contextSize50Tokens carried forward from previous chunk tail as overlap context. Prevents violations spanning chunk boundaries from being missed.
maxEvaluations100Cap on total classifier invocations per stream.
streamTimeoutMs30000Stale stream cleanup timeout.

Token Estimation

Token count is estimated at ~4 characters per token (standard for English text). This is intentionally approximate -- the buffer decides when to classify, not how to tokenize for the model (the model's own tokenizer handles that).


Streaming Modes

The guardrail supports three streaming modes, all implemented within the IGuardrailService contract:

Non-blocking (default)

evaluateOutput() returns null immediately for accumulating chunks. Classification fires asynchronously in the background. On the next evaluateOutput() call, the guardrail checks the previous async result -- if it was a violation, returns BLOCK then. Tokens stream with ~0ms added latency; violations are caught with a one-chunk delay (~2s at chunkSize=200).

Blocking

evaluateOutput() awaits classification before returning. When the buffer has not reached chunkSize, returns null immediately. When the buffer is ready, the call blocks for ~20--60ms while classifiers run. Users see smooth streaming with imperceptible ~60ms micro-pauses every ~2 seconds.

First chunk uses blocking mode (catches injection in the first response -- the most dangerous attack vector). Subsequent chunks use non-blocking for smooth streaming with one-chunk-delayed violation detection.

const pack = createMLClassifierPack({
streamingMode: 'hybrid', // first chunk blocking, rest non-blocking
});

Configuration

MLClassifierPackOptions

OptionTypeDefaultDescription
classifiers{ toxicity?, injection?, jailbreak? }all trueToggle each classifier independently. Pass true for defaults or a ClassifierConfig for overrides.
customClassifiersIContentClassifier[][]Additional classifiers to run alongside defaults.
modelCacheDirstring~/.wunderland/models/Model cache directory (Node.js only).
quantizedbooleantrueUse INT8 quantized models for lower memory.
runtime'node' | 'browser' | 'edge' | 'auto''auto'Runtime environment hint. Auto-detected if omitted.
browserBrowserConfigBrowser-specific configuration (Web Worker, cache strategy).
chunkSizenumber200Tokens per sliding window chunk.
contextSizenumber50Context overlap tokens carried forward.
maxEvaluationsnumber100Max evaluations per stream.
streamingMode'non-blocking' | 'blocking' | 'hybrid''non-blocking'Streaming evaluation strategy.
thresholdsPartial<ClassifierThresholds>Default action thresholds for all classifiers.
guardrailScope'input' | 'output' | 'both''both'Which direction(s) the guardrail applies to.

ClassifierThresholds

ThresholdDefaultDescription
blockThreshold0.9Score above which the stream is BLOCKED immediately.
flagThreshold0.7Score above which the result is FLAGGED for escalation.
warnThreshold0.4Score above which a warning is logged (no action).

Per-Classifier Overrides

const pack = createMLClassifierPack({
classifiers: {
toxicity: {
modelId: 'custom/my-toxicity-model', // override model
thresholds: { blockThreshold: 0.95 }, // override thresholds
labelActions: { identity_hate: 'block' }, // always block this label
},
injection: true, // use defaults
jailbreak: false, // disable entirely
},
});

Browser Support

The extension runs in browser environments using ONNX Runtime WASM:

Web Worker

By default, classification is offloaded to a Web Worker to avoid blocking the UI thread for 50--100ms per chunk. The worker is created lazily on first classification call and falls back to main-thread execution if Worker creation fails (e.g., CSP restrictions).

Cache API

Models are cached in the browser using the Cache API (default) or IndexedDB for persistence across page loads. LRU eviction when maxCacheSize (default 200MB) is exceeded.

Configuration

const pack = createMLClassifierPack({
runtime: 'browser',
browser: {
useWebWorker: true,
cacheStrategy: 'cache-api',
maxCacheSize: 200 * 1024 * 1024,
onProgress: ({ modelId, percent }) => {
console.log(`Downloading ${modelId}: ${percent}%`);
},
},
});

Agent Tools

classify_content

On-demand content safety classification. Lets agents proactively classify arbitrary text before forwarding to external APIs, including in responses, or presenting to users.

Agent: I'll check this user comment for safety before posting.
-> classify_content({ text: "user-submitted comment", classifiers: ["toxicity"] })
<- {
results: [{ classifierId: "toxicity", topLabel: "toxic", topScore: 0.02 }],
recommendedAction: "allow",
triggeredBy: null,
totalLatencyMs: 22
}

Memory Impact

ComponentMemoryWhen Loaded
Toxicity model (toxic-bert INT8)~33MBFirst classification call
Injection model (DeBERTa INT8)~22MBFirst classification call
Jailbreak model (PromptGuard INT8)~43MBFirst classification call
SlidingWindowBuffer state~1KB per streamFirst TEXT_DELTA
Total (all 3 models, 10 streams)~98MB--

All models are lazy-loaded. If only toxicity is enabled, memory cost is ~33MB. Models are shared across extensions via ISharedServiceRegistry -- if another extension uses the same model, zero additional memory.


Graceful Degradation

ConditionBehavior
@huggingface/transformers not installedPack logs error, all messages pass (fail-open)
Model download failsThat classifier marked unavailable, contributes ALLOW to aggregation
ONNX Runtime not availableFalls back to WASM backend (browser/edge)
Single classifier throwsWarning logged, other classifiers continue, failed one contributes ALLOW
Max evaluations exceededRemaining chunks pass without classification
Stream timeoutBuffer state cleaned up, no memory leak