Memory Benchmarks
AgentOS posts +1.4 points above Mastra at the matched gpt-4o reader on LongMemEval-S (85.6% vs Mastra's 84.23%), and is the only open-source library on the public record above 65% on the harder M variant.
This page is the canonical comparison table. Every cell links to its primary source. Cross-provider configurations (e.g. Mastra's gpt-5-mini reader + gemini-2.5-flash observer) are excluded because their results cannot be reproduced from public methodology disclosures.
TL;DR
- LongMemEval-S at full N=500, gpt-4o reader: AgentOS at 85.6% is +1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader. $0.0090 per correct, 3.6-second median latency.
- LongMemEval-M at full N=500, gpt-4o reader: AgentOS at 70.2% is competitive with the strongest published M results in the LongMemEval paper (Wu et al. ICLR 2025, Table 3). The paper's three primary GPT-4o configurations: round Top-5 65.7% (we're +4.5), session Top-5 71.4% (we're 1.2 below), round Top-10 72.0% (we're 1.8 below at the harder Top-5 retrieval budget). First open-source library above 65% on M with publicly reproducible methodology. Closest published external number is AgentBrain's 71.7% from their closed-source SaaS.
- 15 adjacent stress-tested configurations all regress against the 85.6% headline. Locally Pareto-optimal in the tested parameter space.
LongMemEval-S Phase B (115K tokens, 50 sessions per haystack)
Same dataset (data/longmemeval/longmemeval_s.json), full N=500, same gpt-4o-2024-08-06 judge, same gpt-4o reader across every row.
| System | Accuracy | $/correct | p50 latency | Source |
|---|---|---|---|---|
| EmergenceMem Internal | 86.0% | not published | 5,650 ms | emergence.ai |
| 🚀 AgentOS canonical-hybrid + reader-router | 85.6% | $0.0090 | 3,558 ms | 85.6% Pareto-win post |
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published | mastra.ai |
| Supermemory gpt-4o | 81.6% | not published | not published | supermemory.ai |
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms | vendor reproduction adapter |
| Zep self-reported | 71.2% | not published | 632 ms p95 search | getzep.com |
| Zep independently reproduced | 63.8% | not published | not published | arXiv:2512.13564 |
+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader. AgentOS at 85.6% is the highest published number from an open-source library that ships an end-to-end agent runtime around its memory system. EmergenceMem Internal posts 86.0% (0.4 above us). AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
Cost at scale: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.
Why other Mastra and managed-platform numbers are not in this table
- Mastra OM 94.9% uses gpt-5-mini reader + gemini-2.5-flash observer (cross-provider). Their public methodology page does not include enough detail to reproduce the result; we cannot independently verify it.
- Mem0 v3 93.4% is a managed-platform number with no published CI, no judge model disclosure, no reader model disclosure. Their own State of AI Agent Memory 2026 post reports 66.9% on LOCOMO for their production stack, suggesting the 93.4% reflects the managed-evaluation harness more than the architecture.
- Hindsight 91.4% uses
gemini-3-proreader (cross-provider). - Supermemory 85.2% uses
gemini-3-proreader (cross-provider). - agentmemory 96.2% has no published CI, no methodology breakdown.
LongMemEval-M Phase B (1.5M tokens, 500 sessions per haystack)
The harder variant. M's haystacks exceed every production context window: GPT-4o is 128K, Claude Opus is 200K, Gemini 3 Pro is 1M. Most memory vendors stop at S because raw long-context fits there.
| System | Accuracy | License | Source |
|---|---|---|---|
| AgentBrain | 71.7% (Test 0) | closed-source SaaS, requires hosted endpoint | github.com/AgentBrainHQ |
| 🚀 AgentOS (sem-embed + reader-router + top-K=5) | 70.2% | Apache-2.0 | 70.2% post |
| LongMemEval paper, strongest GPT-4o (round, Top-10) | 72.0% | open repo | Wu et al. ICLR 2025, Table 3 |
| LongMemEval paper, GPT-4o session Top-5 | 71.4% | open repo | Wu et al. ICLR 2025, Table 3 |
| LongMemEval paper, GPT-4o round Top-5 | 65.7% | open repo | Wu et al. ICLR 2025, Table 3 |
| Mem0 v3 | not published | Apache 2.0 | reports S only |
| Mastra OM | not published | Apache 2.0 | reports S only |
| Hindsight | not published | open repo | reports S only |
| Zep | not published | Apache 2.0 | "due to gpt-4o's 128K context window we chose S over M" |
| EmergenceMem | not published | open Python | reports S only |
| Supermemory | not published | open | reports S only |
| MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta, Cognee | not published | various | reports S only or no LongMemEval |
Competitive with the strongest published M results in the LongMemEval paper. At matched reader-Top-5 retrieval, AgentOS is +4.5 above the round-level configuration (65.7%) and 1.2 below the session-level configuration (71.4%); the paper's strongest GPT-4o result overall is 72.0% at round-level Top-10. AgentOS is the first open-source library above 65% on M with publicly reproducible methodology (per-case run JSONs at fixed seed, single-CLI reproduction). The closest published external number is AgentBrain's 71.7% from their closed-source SaaS.
The journey: 30.6% → 45.4% → 57.6% → 70.2%
| Date | Configuration | Aggregate | Lift |
|---|---|---|---|
| 2026-04-25 | Tier 1 canonical (CharHash, top-K=20) | 30.6% | baseline |
| 2026-04-26 | M-tuned (HyDE + top-K=50 + rerank-mult=5, CharHash) | 45.4% | +14.8 pp |
| 2026-04-29 | M-tuned + sem-embed + reader-router (top-K=50) | 57.6% | +12.2 pp |
| 2026-04-29 | M-tuned + sem-embed + reader-router + top-K=5 | 70.2% | +12.6 pp |
Cumulative: +39.6 pp over the original baseline. Each step has CIs disjoint from the prior step.
LOCOMO (out-of-distribution transfer)
LongMemEval-tuned pipeline, no LOCOMO-specific tuning, gpt-4o reader, N=1986:
| Configuration | Accuracy | $/correct | Note |
|---|---|---|---|
| AgentOS K=20 retrieval (Pareto-best LOCOMO tuning) | 51.5% | $0.0099 | Stage F-3 |
| AgentOS Tier 1 OOD baseline | 49.9% | $0.0123 | no tuning |
| Mem0 self-reported (managed) | 66-68% | not published | LOCOMO with default gpt-4o-mini judge (Penfield FPR 62.81%) |
Judge FPR comparison (the variable that swings LOCOMO scores 30-60 pp):
| Benchmark | AgentOS judge FPR | LOCOMO default judge FPR |
|---|---|---|
| LongMemEval-S | 1% | not published |
| LongMemEval-M | 2% | not published |
| LOCOMO | 0% | 62.81% (Penfield Labs) |
The 62.81% FPR ceiling on LOCOMO's default gpt-4o-mini judge means any LOCOMO score above ~93.6% benefits from benchmark errors, and any score difference below ~6 pp sits in judge noise. AgentOS uses gpt-4o-2024-08-06 with rubric 2026-04-18.1 which probes at 0% FPR on LOCOMO.
Methodology disclosure (12 axes most vendors omit)
| Axis | AgentOS | Mem0 | Mastra | Supermemory | Zep | Emergence | Letta | MemPalace |
|---|---|---|---|---|---|---|---|---|
| Aggregate accuracy | yes | yes | yes | yes | yes | yes | partial | yes |
| 95% confidence interval | yes | no | no | no | partial | no | no | no |
| Per-category 95% interval | yes | no | no | no | no | no | no | no |
| Reader model disclosed | yes | no | yes | partial | yes | yes | no | no |
| Observer / ingest model disclosed | yes | no | yes | no | yes | yes | no | no |
| USD cost per correct | yes | no | no | no | no | no | no | no |
| Latency avg / p50 / p95 | yes | no | no | no | partial | median only | no | no |
| Per-category breakdown | yes | no | yes | yes | yes | yes | partial | no |
| Open-source benchmark runner | yes | yes | partial | yes | partial | yes | no | partial |
| Per-case run JSONs at fixed seed | yes | no | no | no | no | no | no | no |
| Judge-adversarial FPR probe | yes | no | no | no | no | no | no | no |
| Matched-reader cross-vendor table | yes | no | no | partial | partial | yes | no | no |
The full audit framework is at Memory Benchmark Transparency Audit. Per-case run JSONs at seed=42 are committed under packages/agentos-bench/results/runs/ for every published number.
Reproducing
The 85.6% LongMemEval-S headline:
git clone https://github.com/framersai/agentos-bench
cd agentos-bench
pnpm install && pnpm build
# Set OPENAI_API_KEY and COHERE_API_KEY in your environment
NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-s \
--reader gpt-4o \
--memory full-cognitive --replay ingest \
--hybrid-retrieval --rerank cohere \
--embedder-model text-embedding-3-small \
--reader-router min-cost-best-cat-2026-04-28 \
--concurrency 5 \
--bootstrap-resamples 10000
The 70.2% LongMemEval-M headline (single-variable change is --reader-top-k 5):
NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-m \
--reader gpt-4o \
--memory full-cognitive --replay ingest \
--hybrid-retrieval --rerank cohere --rerank-candidate-multiplier 5 \
--reader-top-k 5 \
--hyde \
--embedder-model text-embedding-3-small \
--reader-router min-cost-best-cat-2026-04-28 \
--concurrency 5 \
--bootstrap-resamples 10000
Both runs ship with per-case run JSONs at seed=42. The full bench leaderboard is at packages/agentos-bench/results/LEADERBOARD.md.
Related blog posts
- 70.2% on LongMemEval-M — current M headline
- 85.6% on LongMemEval-S Pareto-win — current S headline
- Memory Benchmark Transparency Audit — methodology framework
- Two Negative Results: Stage L + Stage I — what we tested and dropped
References
- Wu et al., LongMemEval (ICLR 2025). arXiv:2410.10813.
- Maharana et al., LOCOMO (ACL 2024). aclanthology.org.
- Penfield Labs LOCOMO audit (April 2026). dev.to/penfieldlabs.
- Sumers et al., CoALA (cognitive architectures for language agents). arXiv:2309.02427.
- Packer et al., MemGPT. arXiv:2310.08560. Now part of Letta.
- Northcutt et al., Pervasive label errors in test sets (NeurIPS 2021). arXiv:2103.14749.