Cosmo memory

Where we actually are, what the field has figured out, and what to build next. Fri 24 Apr 2026.

Listen to the whole thing — tap to start, auto-plays each section
Contents
  1. Current state: the code isn't broken, the pattern is
  2. Evidence that vector-RAG memory fails in practice
  3. Karpathy's position (the big one)
  4. What production systems actually do
  5. Proposed Cosmo architecture
  6. Open decisions
  7. Validation notes

1. Current state: the code isn't broken, the pattern is

Audit findings:

flowchart TD U[User via Telegram] --> B[cosmo-bot] B --> A[cosmo-agent] A -.search on NEW session only.-> M[memory.js] M -.connect.-> Q[(Qdrant localhost:6333
NOT RUNNING)] M -.embed.-> O[OpenAI embeddings] A --> C[Claude Agent SDK] C -.emits STORE_MEMORY directive.-> A A -.extractFactsWithSkill
imported but never called.-> X[ ] style Q fill:#2a1818,stroke:#ff6b6b,color:#ff9a9a style X fill:none,stroke:none,color:none
The real problem isn't Qdrant being down. Flipping it on would just start producing junk. The Mem0 pattern (LLM extracts facts, embeds, vector-searches) has been independently measured at ~97.8% junk rate in production [1]. We'd be turning on a broken pattern.

2. Evidence that vector-RAG memory fails in practice

97.8%
of Mem0 entries were junk after a 32-day production audit (10,134 entries). Breakdown: 52.7% system-prompt re-extractions, 11.5% cron/heartbeat noise, 5.2% hallucinated profiles, plus 808 copies of one feedback-loop hallucination.

Upgrading the extractor from Gemma 2B to Claude Sonnet didn't fix it (junk rate 97% → 89.6%). Stronger models extract more indiscriminately. [1]
+5.5pp
LOCOMO benchmark: GPT-4o-mini with a plain filesystem (grep, open, search_files) scores 74.0% vs Mem0's graph variant at 68.5%. The dumb filesystem beats the specialised memory layer.

Caveat: benchmark run by Letta, who competes with Mem0. Treat as directional. [2]

Across everything we surveyed, the same five patterns showed up in systems that actually work:

  1. Filesystem > vector DB as the primitive. LLMs wield files fluently because the training data is full of filesystems. They do not wield vector stores fluently.
  2. Write-time categorisation beats read-time similarity. Forcing a bucket choice on the way in (Claude Code's user/feedback/project/reference, Skills in OpenClaw, profiles in LangGraph) is what prevents the junk pile.
  3. Sleep-time consolidation. Scheduled background pass dedupes, resolves contradictions, rewrites relative dates to absolute, prunes. Letta's sleep-time compute research shows ~5× reduction in test-time compute and up to 18% accuracy lift on AIME. [3]
  4. Bi-temporal facts (Zep/Graphiti). Every fact gets valid_from / valid_to. "I moved to Adelaide" doesn't nuke "I lived in Melbourne before" — the old fact just gets invalidated.
  5. Stop retrieving, start injecting. ChatGPT's memory feature has no vector DB. Four blocks get injected on every request. With 1M context + prompt caching, retrieval is a solution to a problem you shouldn't have (too much junk). [4]

3. Karpathy's position (the big one)

This is the single most important finding. Karpathy has been public and consistent about memory for personal agents. Three weeks ago (Fri 3 Apr 2026) he published his own working architecture.

LLMs are a bit like a coworker with anterograde amnesia. They don't consolidate or build long-running knowledge or expertise once training is over and all they have is short-term memory (context window). It's hard to build relationships (see: 50 First Dates) or do work (see: Memento) with this condition. — Andrej Karpathy, X, 4 Jun 2025
+1 for "context engineering" over "prompt engineering". In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. — Andrej Karpathy, X, 25 Jun 2025
These models don't really have a distillation phase of taking what happened, analyzing it obsessively, thinking through it, doing some synthetic data generation process and distilling it back into the weights… I'd love to have them have less memory so that they have to look things up, and they only maintain the algorithms for thought. — Karpathy on Dwarkesh Patel podcast, 17 Oct 2025

The LLM Wiki (his current, published answer)

Published as a gist on Fri 3 Apr 2026 [5]. His actual working memory architecture for personal AI use. Three layers:

flowchart TD R["Layer 1
Raw sources
conversations, notes,
emails, clips"] S["Layer 3
SCHEMA.md
rules for maintaining
the wiki"] W["Layer 2
The Wiki
markdown pages per topic
LLM owns this"] Q[query] L["lint pass
contradictions,
stale, orphans"] R -->|ingest| W S -.guides.-> W Q --> W L --> W style W fill:#1b2d1e,stroke:#7bd88f

Operations:

Karpathy's explicit framing: no vector databases, no RAG pipelines, just markdown + an LLM acting as a full-time librarian. He argues RAG re-derives knowledge on every query — stateless, amnesiac, wasteful. A wiki does the cognitive lift once, at ingestion, and keeps a structured interlinked artifact. VentureBeat coverage.

4. What production systems actually do

SystemPrimitiveRead pathStale factsVerdict
ChatGPT memory Text blocks injected every prompt No retrieval — just stuff context LLM rewrite on update works, opaque
Claude Code auto-memory MEMORY.md index + topic files Index always loaded, files on-demand Write-time dedup + future Auto Dream works, inspectable
Claude API Memory Tool /memories/ filesystem, CRUD Agent chooses via tool calls Agent-managed primitive, flexible
Mem0 Vector + graph + KV hybrid Embedding similarity LLM UPDATE vs ADD 97.8% junk in practice
Letta / MemGPT Core / recall / archival tiers, OR filesystem Agent pages between tiers Agent rewrites core filesystem variant wins
Zep / Graphiti Bi-temporal knowledge graph Hybrid graph + semantic Invalidate old edges, don't overwrite cleanest staleness model
Karpathy's LLM Wiki Markdown wiki, LLM-curated LLM reads pages by filename Lint pass what a smart person uses
Claude Code Auto Dream (unshipped) Background consolidation pass n/a — it's a write-time op Orient → Gather → Consolidate → Prune flag gated, reimplementable

5. Proposed Cosmo architecture

The convergence across Karpathy, Claude Code, ChatGPT, and Letta's benchmarks is striking. Everyone keeps arriving at: markdown files + LLM librarian + scheduled consolidation. That's what we should build.

Storage layout

~/cosmo-memory/
├── INDEX.md                    # always loaded, <200 lines, the map
├── SCHEMA.md                   # rules the librarian follows
├── topics/
│   ├── user_profile.md         # stable facts: name, city, preferences
│   ├── health.md               # → links to .claude/skills/health
│   ├── work_humankind.md
│   ├── work_h2os.md
│   ├── projects_cosmo.md
│   ├── projects_activism.md
│   ├── relationships.md
│   └── ...                     # one file per meaningful life domain
└── episodes/
    └── 2026-04/                # monthly folders of raw episodes
        ├── 2026-04-24-17-45.md # one per significant conversation
        └── ...

Read path (every turn)

flowchart TD U[User message] --> B[cosmo-bot] B --> A[cosmo-agent] A -->|always inject| I[INDEX.md
~200 lines] A -->|always inject| P[user_profile.md] A -->|router LLM picks 1-3| T["topics/*.md"] I --> C[Claude] P --> C T --> C C --> R[Response to user] style I fill:#1b2d1e,stroke:#7bd88f style P fill:#1b2d1e,stroke:#7bd88f

INDEX + profile are cheap (< 3k tokens combined) and always-warm in prompt cache. Router LLM (cheap Haiku call) picks 1-3 topic files based on the user's message. No embeddings. No vector search.

Write path (end of turn)

flowchart TD R[Full turn exchange] --> E["Extractor
(Haiku, structured output)"] E -->|nothing notable| X[skip - don't save] E -->|notable fact| D{Decide} D -->|new fact| N["Append to
episodes/YYYY-MM/*.md"] D -->|updates existing| U["Edit topic file
(str_replace one line)"] N --> Q[flagged for
next dream pass] U --> L[done] style X fill:#1a1a1a,stroke:#555 style L fill:#1b2d1e,stroke:#7bd88f

Key difference from current: extractor must pick a bucket at write time. No uncategorised blobs. If it can't confidently pick, the fact goes to episodes and waits for the dream pass.

Dream pass (nightly, PM2 cron)

flowchart TD S[Scheduler
daily 3am] --> O[ORIENT
read INDEX +
last 24h episodes] O --> G[GATHER
corrections
repeated patterns
relative dates] G --> CO[CONSOLIDATE
promote episodes
resolve contradictions
absolute dates] CO --> PR[PRUNE + INDEX
cap INDEX at 200 lines
demote low-signal
regenerate INDEX] PR --> DN[Done
Telegram digest
to user] style DN fill:#1b2d1e,stroke:#7bd88f

This is where staleness and junk get cleaned. Same four phases as Auto Dream [6]. Runs as its own PM2 process (or a cron-triggered Claude Agent SDK call). Telegram digest every morning: "I learned 3 things yesterday, updated 2 facts, flagged 1 contradiction. Review?"

What we delete and why

What we keep

6. Open decisions

  1. Storage location. Disk at ~/cosmo-memory/ (git-versionable, Dropbox-syncable, inspectable) vs Firestore (queryable from web UI, multi-device). I lean: disk as source of truth, nightly sync to Firestore for read-only web access.
  2. Router model. Use Haiku for the topic-file router? (Fast, cheap, good enough.) Or skip router and inject all topic files up to a token budget? With prompt caching the second option is cheaper than it sounds.
  3. Bi-temporal rigour. Do we really want valid_from/valid_to on every fact, or only for the things that change (location, relationships, jobs, phase of training)? I lean: opt-in per fact class.
  4. Dream cadence. Nightly at 3am is the obvious default. But we should probably also trigger a dream pass on /clear since that's when you're explicitly asking for a reset.
  5. Migration. We have zero meaningful data in Qdrant (it's not running). So there's nothing to migrate. We can start fresh.

Validation notes

The research behind this page was independently validated by a second agent. All seven key claims traced to real primary sources. Minor caveats to note:

  1. Mem0 97.8% junk. verified GitHub issue mem0ai/mem0#4573. Exact numbers match. Slight overstatement: Sonnet upgrade "didn't fix" rather than "made it worse" (junk dropped 97% → 89.6%).
  2. Letta 74% vs Mem0 68.5% on LOCOMO. verified Letta blog. Benchmark run by Letta themselves, so treat as directional.
  3. Sleep-time Compute. verified with correction arXiv 2504.13171. Actual numbers: ~5× reduction in test-time compute (GSM-Symbolic + AIME), 2.5× cost-per-query reduction (different metric), up to 13% accuracy on GSM-Symbolic and 18% on AIME. The "2.5× or 5×" phrasing in my earlier summary conflated two different metrics.
  4. ChatGPT memory = four blocks, no RAG. reverse-engineered, not OpenAI-confirmed Gupta, Khemani. Both authors explicitly disclose they derived this by asking ChatGPT about itself. Credible but not official.
  5. Karpathy's LLM Wiki gist. verified gist, published Fri 3 Apr 2026. Covered by VentureBeat.
  6. Claude Code Auto Dream. verified, but reverse-engineered dream-skill, claude-code-secrets. Feature flag tengu_onyx_plover confirmed. Four phases confirmed. Anthropic has not officially announced this — found in npm sourcemap leak.
  7. OpenClaw. verified github.com/openclaw/openclaw — real, active project. Skills stored at ~/.openclaw/workspace/skills/<skill>/SKILL.md. "Skills-as-memory" is my framing, not theirs.

Sources pulled Fri 24 Apr 2026, ~16:00 ACST. Document saved at specs/research/memory-architecture-proposal.html.

Now playing