Skip to main content

Memory Benchmarks

AgentOS posts +1.4 points above Mastra at the matched gpt-4o reader on LongMemEval-S (85.6% vs Mastra's 84.23%), and is the only open-source library on the public record above 65% on the harder M variant.

This page is the canonical comparison table. Every cell links to its primary source. Cross-provider configurations (e.g. Mastra's gpt-5-mini reader + gemini-2.5-flash observer) are excluded because their results cannot be reproduced from public methodology disclosures.

TL;DR

  • LongMemEval-S at full N=500, gpt-4o reader: AgentOS at 85.6% is +1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader. $0.0090 per correct, 3.6-second median latency.
  • LongMemEval-M at full N=500, gpt-4o reader: AgentOS at 70.2% is competitive with the strongest published M results in the LongMemEval paper (Wu et al. ICLR 2025, Table 3). The paper's three primary GPT-4o configurations: round Top-5 65.7% (we're +4.5), session Top-5 71.4% (we're 1.2 below), round Top-10 72.0% (we're 1.8 below at the harder Top-5 retrieval budget). First open-source library above 65% on M with publicly reproducible methodology. Closest published external number is AgentBrain's 71.7% from their closed-source SaaS.
  • 15 adjacent stress-tested configurations all regress against the 85.6% headline. Locally Pareto-optimal in the tested parameter space.

LongMemEval-S Phase B (115K tokens, 50 sessions per haystack)

Same dataset (data/longmemeval/longmemeval_s.json), full N=500, same gpt-4o-2024-08-06 judge, same gpt-4o reader across every row.

SystemAccuracy$/correctp50 latencySource
EmergenceMem Internal86.0%not published5,650 msemergence.ai
🚀 AgentOS canonical-hybrid + reader-router85.6%$0.00903,558 ms85.6% Pareto-win post
Mastra OM gpt-4o (gemini-flash observer)84.23%not publishednot publishedmastra.ai
Supermemory gpt-4o81.6%not publishednot publishedsupermemory.ai
EmergenceMem Simple Fast (rerun in agentos-bench)80.6%$0.05863,703 msvendor reproduction adapter
Zep self-reported71.2%not published632 ms p95 searchgetzep.com
Zep independently reproduced63.8%not publishednot publishedarXiv:2512.13564

+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader. AgentOS at 85.6% is the highest published number from an open-source library that ships an end-to-end agent runtime around its memory system. EmergenceMem Internal posts 86.0% (0.4 above us). AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.

Cost at scale: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.

Why other Mastra and managed-platform numbers are not in this table

  • Mastra OM 94.9% uses gpt-5-mini reader + gemini-2.5-flash observer (cross-provider). Their public methodology page does not include enough detail to reproduce the result; we cannot independently verify it.
  • Mem0 v3 93.4% is a managed-platform number with no published CI, no judge model disclosure, no reader model disclosure. Their own State of AI Agent Memory 2026 post reports 66.9% on LOCOMO for their production stack, suggesting the 93.4% reflects the managed-evaluation harness more than the architecture.
  • Hindsight 91.4% uses gemini-3-pro reader (cross-provider).
  • Supermemory 85.2% uses gemini-3-pro reader (cross-provider).
  • agentmemory 96.2% has no published CI, no methodology breakdown.

LongMemEval-M Phase B (1.5M tokens, 500 sessions per haystack)

The harder variant. M's haystacks exceed every production context window: GPT-4o is 128K, Claude Opus is 200K, Gemini 3 Pro is 1M. Most memory vendors stop at S because raw long-context fits there.

SystemAccuracyLicenseSource
AgentBrain71.7% (Test 0)closed-source SaaS, requires hosted endpointgithub.com/AgentBrainHQ
🚀 AgentOS (sem-embed + reader-router + top-K=5)70.2%Apache-2.070.2% post
LongMemEval paper, strongest GPT-4o (round, Top-10)72.0%open repoWu et al. ICLR 2025, Table 3
LongMemEval paper, GPT-4o session Top-571.4%open repoWu et al. ICLR 2025, Table 3
LongMemEval paper, GPT-4o round Top-565.7%open repoWu et al. ICLR 2025, Table 3
Mem0 v3not publishedApache 2.0reports S only
Mastra OMnot publishedApache 2.0reports S only
Hindsightnot publishedopen reporeports S only
Zepnot publishedApache 2.0"due to gpt-4o's 128K context window we chose S over M"
EmergenceMemnot publishedopen Pythonreports S only
Supermemorynot publishedopenreports S only
MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta, Cogneenot publishedvariousreports S only or no LongMemEval

Competitive with the strongest published M results in the LongMemEval paper. At matched reader-Top-5 retrieval, AgentOS is +4.5 above the round-level configuration (65.7%) and 1.2 below the session-level configuration (71.4%); the paper's strongest GPT-4o result overall is 72.0% at round-level Top-10. AgentOS is the first open-source library above 65% on M with publicly reproducible methodology (per-case run JSONs at fixed seed, single-CLI reproduction). The closest published external number is AgentBrain's 71.7% from their closed-source SaaS.

The journey: 30.6% → 45.4% → 57.6% → 70.2%

DateConfigurationAggregateLift
2026-04-25Tier 1 canonical (CharHash, top-K=20)30.6%baseline
2026-04-26M-tuned (HyDE + top-K=50 + rerank-mult=5, CharHash)45.4%+14.8 pp
2026-04-29M-tuned + sem-embed + reader-router (top-K=50)57.6%+12.2 pp
2026-04-29M-tuned + sem-embed + reader-router + top-K=570.2%+12.6 pp

Cumulative: +39.6 pp over the original baseline. Each step has CIs disjoint from the prior step.

LOCOMO (out-of-distribution transfer)

LongMemEval-tuned pipeline, no LOCOMO-specific tuning, gpt-4o reader, N=1986:

ConfigurationAccuracy$/correctNote
AgentOS K=20 retrieval (Pareto-best LOCOMO tuning)51.5%$0.0099Stage F-3
AgentOS Tier 1 OOD baseline49.9%$0.0123no tuning
Mem0 self-reported (managed)66-68%not publishedLOCOMO with default gpt-4o-mini judge (Penfield FPR 62.81%)

Judge FPR comparison (the variable that swings LOCOMO scores 30-60 pp):

BenchmarkAgentOS judge FPRLOCOMO default judge FPR
LongMemEval-S1%not published
LongMemEval-M2%not published
LOCOMO0%62.81% (Penfield Labs)

The 62.81% FPR ceiling on LOCOMO's default gpt-4o-mini judge means any LOCOMO score above ~93.6% benefits from benchmark errors, and any score difference below ~6 pp sits in judge noise. AgentOS uses gpt-4o-2024-08-06 with rubric 2026-04-18.1 which probes at 0% FPR on LOCOMO.

Methodology disclosure (12 axes most vendors omit)

AxisAgentOSMem0MastraSupermemoryZepEmergenceLettaMemPalace
Aggregate accuracyyesyesyesyesyesyespartialyes
95% confidence intervalyesnononopartialnonono
Per-category 95% intervalyesnonononononono
Reader model disclosedyesnoyespartialyesyesnono
Observer / ingest model disclosedyesnoyesnoyesyesnono
USD cost per correctyesnonononononono
Latency avg / p50 / p95yesnononopartialmedian onlynono
Per-category breakdownyesnoyesyesyesyespartialno
Open-source benchmark runneryesyespartialyespartialyesnopartial
Per-case run JSONs at fixed seedyesnonononononono
Judge-adversarial FPR probeyesnonononononono
Matched-reader cross-vendor tableyesnonopartialpartialyesnono

The full audit framework is at Memory Benchmark Transparency Audit. Per-case run JSONs at seed=42 are committed under packages/agentos-bench/results/runs/ for every published number.

Reproducing

The 85.6% LongMemEval-S headline:

git clone https://github.com/framersai/agentos-bench
cd agentos-bench
pnpm install && pnpm build

# Set OPENAI_API_KEY and COHERE_API_KEY in your environment
NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-s \
--reader gpt-4o \
--memory full-cognitive --replay ingest \
--hybrid-retrieval --rerank cohere \
--embedder-model text-embedding-3-small \
--reader-router min-cost-best-cat-2026-04-28 \
--concurrency 5 \
--bootstrap-resamples 10000

The 70.2% LongMemEval-M headline (single-variable change is --reader-top-k 5):

NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-m \
--reader gpt-4o \
--memory full-cognitive --replay ingest \
--hybrid-retrieval --rerank cohere --rerank-candidate-multiplier 5 \
--reader-top-k 5 \
--hyde \
--embedder-model text-embedding-3-small \
--reader-router min-cost-best-cat-2026-04-28 \
--concurrency 5 \
--bootstrap-resamples 10000

Both runs ship with per-case run JSONs at seed=42. The full bench leaderboard is at packages/agentos-bench/results/LEADERBOARD.md.

References