Memory Benchmarks

AgentOS posts +1.4 points above Mastra at the matched gpt-4o reader on LongMemEval-S (85.6% vs Mastra's 84.23%), and is the only open-source library on the public record above 65% on the harder M variant.

This page is the canonical comparison table. Every cell links to its primary source. Cross-provider configurations (e.g. Mastra's gpt-5-mini reader + gemini-2.5-flash observer) are excluded because their results cannot be reproduced from public methodology disclosures.

TL;DR

LongMemEval-S at full N=500, gpt-4o reader: AgentOS at 85.6% is +1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader. $0.0090 per correct, 3.6-second median latency.
LongMemEval-M at full N=500, gpt-4o reader: AgentOS at 70.2% is competitive with the strongest published M results in the LongMemEval paper (Wu et al. ICLR 2025, Table 3). The paper's three primary GPT-4o configurations: round Top-5 65.7% (we're +4.5), session Top-5 71.4% (we're 1.2 below), round Top-10 72.0% (we're 1.8 below at the harder Top-5 retrieval budget). First open-source library above 65% on M with publicly reproducible methodology. Closest published external number is AgentBrain's 71.7% from their closed-source SaaS.
15 adjacent stress-tested configurations all regress against the 85.6% headline. Locally Pareto-optimal in the tested parameter space.

LongMemEval-S Phase B (115K tokens, 50 sessions per haystack)

Same dataset (data/longmemeval/longmemeval_s.json), full N=500, same gpt-4o-2024-08-06 judge, same gpt-4o reader across every row.

System	Accuracy	$/correct	p50 latency	Source
EmergenceMem Internal	86.0%	not published	5,650 ms	emergence.ai
🚀 AgentOS canonical-hybrid + reader-router	85.6%	$0.0090	3,558 ms	85.6% Pareto-win post
Mastra OM gpt-4o (gemini-flash observer)	84.23%	not published	not published	mastra.ai
Supermemory gpt-4o	81.6%	not published	not published	supermemory.ai
EmergenceMem Simple Fast (rerun in agentos-bench)	80.6%	$0.0586	3,703 ms	vendor reproduction adapter
Zep self-reported	71.2%	not published	632 ms p95 search	getzep.com
Zep independently reproduced	63.8%	not published	not published	arXiv:2512.13564

+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader. AgentOS at 85.6% is the highest published number from an open-source library that ships an end-to-end agent runtime around its memory system. EmergenceMem Internal posts 86.0% (0.4 above us). AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.

Cost at scale: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.

Why other Mastra and managed-platform numbers are not in this table

Mastra OM 94.9% uses gpt-5-mini reader + gemini-2.5-flash observer (cross-provider). Their public methodology page does not include enough detail to reproduce the result; we cannot independently verify it.
Mem0 v3 93.4% is a managed-platform number with no published CI, no judge model disclosure, no reader model disclosure. Their own State of AI Agent Memory 2026 post reports 66.9% on LOCOMO for their production stack, suggesting the 93.4% reflects the managed-evaluation harness more than the architecture.
Hindsight 91.4% uses gemini-3-pro reader (cross-provider).
Supermemory 85.2% uses gemini-3-pro reader (cross-provider).
agentmemory 96.2% has no published CI, no methodology breakdown.

LongMemEval-M Phase B (1.5M tokens, 500 sessions per haystack)

The harder variant. M's haystacks exceed every production context window: GPT-4o is 128K, Claude Opus is 200K, Gemini 3 Pro is 1M. Most memory vendors stop at S because raw long-context fits there.

System	Accuracy	License	Source
AgentBrain	71.7% (Test 0)	closed-source SaaS, requires hosted endpoint	github.com/AgentBrainHQ
🚀 AgentOS (sem-embed + reader-router + top-K=5)	70.2%	Apache-2.0	70.2% post
LongMemEval paper, strongest GPT-4o (round, Top-10)	72.0%	open repo	Wu et al. ICLR 2025, Table 3
LongMemEval paper, GPT-4o session Top-5	71.4%	open repo	Wu et al. ICLR 2025, Table 3
LongMemEval paper, GPT-4o round Top-5	65.7%	open repo	Wu et al. ICLR 2025, Table 3
Mem0 v3	not published	Apache 2.0	reports S only
Mastra OM	not published	Apache 2.0	reports S only
Hindsight	not published	open repo	reports S only
Zep	not published	Apache 2.0	"due to gpt-4o's 128K context window we chose S over M"
EmergenceMem	not published	open Python	reports S only
Supermemory	not published	open	reports S only
MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta, Cognee	not published	various	reports S only or no LongMemEval

Competitive with the strongest published M results in the LongMemEval paper. At matched reader-Top-5 retrieval, AgentOS is +4.5 above the round-level configuration (65.7%) and 1.2 below the session-level configuration (71.4%); the paper's strongest GPT-4o result overall is 72.0% at round-level Top-10. AgentOS is the first open-source library above 65% on M with publicly reproducible methodology (per-case run JSONs at fixed seed, single-CLI reproduction). The closest published external number is AgentBrain's 71.7% from their closed-source SaaS.

The journey: 30.6% → 45.4% → 57.6% → 70.2%

Date	Configuration	Aggregate	Lift
2026-04-25	Tier 1 canonical (CharHash, top-K=20)	30.6%	baseline
2026-04-26	M-tuned (HyDE + top-K=50 + rerank-mult=5, CharHash)	45.4%	+14.8 pp
2026-04-29	M-tuned + sem-embed + reader-router (top-K=50)	57.6%	+12.2 pp
2026-04-29	M-tuned + sem-embed + reader-router + top-K=5	70.2%	+12.6 pp

Cumulative: +39.6 pp over the original baseline. Each step has CIs disjoint from the prior step.

LOCOMO (out-of-distribution transfer)

LongMemEval-tuned pipeline, no LOCOMO-specific tuning, gpt-4o reader, N=1986:

Configuration	Accuracy	$/correct	Note
AgentOS K=20 retrieval (Pareto-best LOCOMO tuning)	51.5%	$0.0099	Stage F-3
AgentOS Tier 1 OOD baseline	49.9%	$0.0123	no tuning
Mem0 self-reported (managed)	66-68%	not published	LOCOMO with default `gpt-4o-mini` judge (Penfield FPR 62.81%)

Judge FPR comparison (the variable that swings LOCOMO scores 30-60 pp):

Benchmark	AgentOS judge FPR	LOCOMO default judge FPR
LongMemEval-S	1%	not published
LongMemEval-M	2%	not published
LOCOMO	0%	62.81% (Penfield Labs)

The 62.81% FPR ceiling on LOCOMO's default gpt-4o-mini judge means any LOCOMO score above ~93.6% benefits from benchmark errors, and any score difference below ~6 pp sits in judge noise. AgentOS uses gpt-4o-2024-08-06 with rubric 2026-04-18.1 which probes at 0% FPR on LOCOMO.

Methodology disclosure (12 axes most vendors omit)

Axis	AgentOS	Mem0	Mastra	Supermemory	Zep	Emergence	Letta	MemPalace
Aggregate accuracy	yes	yes	yes	yes	yes	yes	partial	yes
95% confidence interval	yes	no	no	no	partial	no	no	no
Per-category 95% interval	yes	no	no	no	no	no	no	no
Reader model disclosed	yes	no	yes	partial	yes	yes	no	no
Observer / ingest model disclosed	yes	no	yes	no	yes	yes	no	no
USD cost per correct	yes	no	no	no	no	no	no	no
Latency avg / p50 / p95	yes	no	no	no	partial	median only	no	no
Per-category breakdown	yes	no	yes	yes	yes	yes	partial	no
Open-source benchmark runner	yes	yes	partial	yes	partial	yes	no	partial
Per-case run JSONs at fixed seed	yes	no	no	no	no	no	no	no
Judge-adversarial FPR probe	yes	no	no	no	no	no	no	no
Matched-reader cross-vendor table	yes	no	no	partial	partial	yes	no	no

The full audit framework is at Memory Benchmark Transparency Audit. Per-case run JSONs at seed=42 are committed under packages/agentos-bench/results/runs/ for every published number.

Reproducing

The 85.6% LongMemEval-S headline:

git clone https://github.com/framersai/agentos-bench
cd agentos-bench
pnpm install && pnpm build

# Set OPENAI_API_KEY and COHERE_API_KEY in your environment
NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-s \
  --reader gpt-4o \
  --memory full-cognitive --replay ingest \
  --hybrid-retrieval --rerank cohere \
  --embedder-model text-embedding-3-small \
  --reader-router min-cost-best-cat-2026-04-28 \
  --concurrency 5 \
  --bootstrap-resamples 10000

The 70.2% LongMemEval-M headline (single-variable change is --reader-top-k 5):

NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-m \
  --reader gpt-4o \
  --memory full-cognitive --replay ingest \
  --hybrid-retrieval --rerank cohere --rerank-candidate-multiplier 5 \
  --reader-top-k 5 \
  --hyde \
  --embedder-model text-embedding-3-small \
  --reader-router min-cost-best-cat-2026-04-28 \
  --concurrency 5 \
  --bootstrap-resamples 10000

Both runs ship with per-case run JSONs at seed=42. The full bench leaderboard is at packages/agentos-bench/results/LEADERBOARD.md.

70.2% on LongMemEval-M — current M headline
85.6% on LongMemEval-S Pareto-win — current S headline
Memory Benchmark Transparency Audit — methodology framework
Two Negative Results: Stage L + Stage I — what we tested and dropped

References

Wu et al., LongMemEval (ICLR 2025). arXiv:2410.10813.
Maharana et al., LOCOMO (ACL 2024). aclanthology.org.
Penfield Labs LOCOMO audit (April 2026). dev.to/penfieldlabs.
Sumers et al., CoALA (cognitive architectures for language agents). arXiv:2309.02427.
Packer et al., MemGPT. arXiv:2310.08560. Now part of Letta.
Northcutt et al., Pervasive label errors in test sets (NeurIPS 2021). arXiv:2103.14749.

TL;DR​

LongMemEval-S Phase B (115K tokens, 50 sessions per haystack)​

Why other Mastra and managed-platform numbers are not in this table​

LongMemEval-M Phase B (1.5M tokens, 500 sessions per haystack)​

The journey: 30.6% → 45.4% → 57.6% → 70.2%​

LOCOMO (out-of-distribution transfer)​

Methodology disclosure (12 axes most vendors omit)​

Reproducing​

Related blog posts​

References​