Cosmo memory
Where we actually are, what the field has figured out, and what to build next.
1. Current state: the code isn't broken, the pattern is
Audit findings:
src/memory.jsis 469 lines, complete. Qdrant client, OpenAI embeddings (text-embedding-3-small, 1536 dim), dedup at 0.92 similarity, sensitive-data filter.- Wired into
src/agent.js:717-733(read) and:1474-1493(write via[STORE_MEMORY:]directives). - Qdrant isn't running on localhost:6333. No
QDRANT_URLin.env. All calls fail silently inside try/catch. extractFactsWithSkillis imported atagent.js:53and never called.- Memory search skipped on resumed sessions (
agent.js:719).
NOT RUNNING)] M -.embed.-> O[OpenAI embeddings] A --> C[Claude Agent SDK] C -.emits STORE_MEMORY directive.-> A A -.extractFactsWithSkill
imported but never called.-> X[ ] style Q fill:#2a1818,stroke:#ff6b6b,color:#ff9a9a style X fill:none,stroke:none,color:none
2. Evidence that vector-RAG memory fails in practice
Upgrading the extractor from Gemma 2B to Claude Sonnet didn't fix it (junk rate 97% → 89.6%). Stronger models extract more indiscriminately. [1]
Caveat: benchmark run by Letta, who competes with Mem0. Treat as directional. [2]
Across everything we surveyed, the same five patterns showed up in systems that actually work:
- Filesystem > vector DB as the primitive. LLMs wield files fluently because the training data is full of filesystems. They do not wield vector stores fluently.
- Write-time categorisation beats read-time similarity. Forcing a bucket choice on the way in (Claude Code's user/feedback/project/reference, Skills in OpenClaw, profiles in LangGraph) is what prevents the junk pile.
- Sleep-time consolidation. Scheduled background pass dedupes, resolves contradictions, rewrites relative dates to absolute, prunes. Letta's sleep-time compute research shows ~5× reduction in test-time compute and up to 18% accuracy lift on AIME. [3]
- Bi-temporal facts (Zep/Graphiti). Every fact gets
valid_from/valid_to. "I moved to Adelaide" doesn't nuke "I lived in Melbourne before" — the old fact just gets invalidated. - Stop retrieving, start injecting. ChatGPT's memory feature has no vector DB. Four blocks get injected on every request. With 1M context + prompt caching, retrieval is a solution to a problem you shouldn't have (too much junk). [4]
3. Karpathy's position (the big one)
This is the single most important finding. Karpathy has been public and consistent about memory for personal agents. Three weeks ago (Fri 3 Apr 2026) he published his own working architecture.
LLMs are a bit like a coworker with anterograde amnesia. They don't consolidate or build long-running knowledge or expertise once training is over and all they have is short-term memory (context window). It's hard to build relationships (see: 50 First Dates) or do work (see: Memento) with this condition. — Andrej Karpathy, X, 4 Jun 2025
+1 for "context engineering" over "prompt engineering". In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. — Andrej Karpathy, X, 25 Jun 2025
These models don't really have a distillation phase of taking what happened, analyzing it obsessively, thinking through it, doing some synthetic data generation process and distilling it back into the weights… I'd love to have them have less memory so that they have to look things up, and they only maintain the algorithms for thought. — Karpathy on Dwarkesh Patel podcast, 17 Oct 2025
The LLM Wiki (his current, published answer)
Published as a gist on Fri 3 Apr 2026 [5]. His actual working memory architecture for personal AI use. Three layers:
Raw sources
conversations, notes,
emails, clips"] S["Layer 3
SCHEMA.md
rules for maintaining
the wiki"] W["Layer 2
The Wiki
markdown pages per topic
LLM owns this"] Q[query] L["lint pass
contradictions,
stale, orphans"] R -->|ingest| W S -.guides.-> W Q --> W L --> W style W fill:#1b2d1e,stroke:#7bd88f
Operations:
- ingest — one source at a time, LLM updates relevant wiki pages
- query — synthesise an answer from the wiki, file results back
- lint — scan for contradictions, stale claims, orphans
4. What production systems actually do
| System | Primitive | Read path | Stale facts | Verdict |
|---|---|---|---|---|
| ChatGPT memory | Text blocks injected every prompt | No retrieval — just stuff context | LLM rewrite on update | works, opaque |
| Claude Code auto-memory | MEMORY.md index + topic files | Index always loaded, files on-demand | Write-time dedup + future Auto Dream | works, inspectable |
| Claude API Memory Tool | /memories/ filesystem, CRUD |
Agent chooses via tool calls | Agent-managed | primitive, flexible |
| Mem0 | Vector + graph + KV hybrid | Embedding similarity | LLM UPDATE vs ADD | 97.8% junk in practice |
| Letta / MemGPT | Core / recall / archival tiers, OR filesystem | Agent pages between tiers | Agent rewrites core | filesystem variant wins |
| Zep / Graphiti | Bi-temporal knowledge graph | Hybrid graph + semantic | Invalidate old edges, don't overwrite | cleanest staleness model |
| Karpathy's LLM Wiki | Markdown wiki, LLM-curated | LLM reads pages by filename | Lint pass | what a smart person uses |
| Claude Code Auto Dream (unshipped) | Background consolidation pass | n/a — it's a write-time op | Orient → Gather → Consolidate → Prune | flag gated, reimplementable |
5. Proposed Cosmo architecture
The convergence across Karpathy, Claude Code, ChatGPT, and Letta's benchmarks is striking. Everyone keeps arriving at: markdown files + LLM librarian + scheduled consolidation. That's what we should build.
Storage layout
~/cosmo-memory/
├── INDEX.md # always loaded, <200 lines, the map
├── SCHEMA.md # rules the librarian follows
├── topics/
│ ├── user_profile.md # stable facts: name, city, preferences
│ ├── health.md # → links to .claude/skills/health
│ ├── work_humankind.md
│ ├── work_h2os.md
│ ├── projects_cosmo.md
│ ├── projects_activism.md
│ ├── relationships.md
│ └── ... # one file per meaningful life domain
└── episodes/
└── 2026-04/ # monthly folders of raw episodes
├── 2026-04-24-17-45.md # one per significant conversation
└── ...
Read path (every turn)
~200 lines] A -->|always inject| P[user_profile.md] A -->|router LLM picks 1-3| T["topics/*.md"] I --> C[Claude] P --> C T --> C C --> R[Response to user] style I fill:#1b2d1e,stroke:#7bd88f style P fill:#1b2d1e,stroke:#7bd88f
INDEX + profile are cheap (< 3k tokens combined) and always-warm in prompt cache. Router LLM (cheap Haiku call) picks 1-3 topic files based on the user's message. No embeddings. No vector search.
Write path (end of turn)
(Haiku, structured output)"] E -->|nothing notable| X[skip - don't save] E -->|notable fact| D{Decide} D -->|new fact| N["Append to
episodes/YYYY-MM/*.md"] D -->|updates existing| U["Edit topic file
(str_replace one line)"] N --> Q[flagged for
next dream pass] U --> L[done] style X fill:#1a1a1a,stroke:#555 style L fill:#1b2d1e,stroke:#7bd88f
Key difference from current: extractor must pick a bucket at write time. No uncategorised blobs. If it can't confidently pick, the fact goes to episodes and waits for the dream pass.
Dream pass (nightly, PM2 cron)
daily 3am] --> O[ORIENT
read INDEX +
last 24h episodes] O --> G[GATHER
corrections
repeated patterns
relative dates] G --> CO[CONSOLIDATE
promote episodes
resolve contradictions
absolute dates] CO --> PR[PRUNE + INDEX
cap INDEX at 200 lines
demote low-signal
regenerate INDEX] PR --> DN[Done
Telegram digest
to user] style DN fill:#1b2d1e,stroke:#7bd88f
This is where staleness and junk get cleaned. Same four phases as Auto Dream [6]. Runs as its own PM2 process (or a cron-triggered Claude Agent SDK call). Telegram digest every morning: "I learned 3 things yesterday, updated 2 facts, flagged 1 contradiction. Review?"
What we delete and why
- Qdrant. No semantic search. Files only.
- OpenAI embedding calls. Not needed.
- The
[STORE_MEMORY:]directive pattern. Replace with structured extractor sub-agent.
What we keep
- Skills at
.claude/skills/(procedural memory, already good). - Firestore messages collection (raw event log, searchable backup).
- The
/memoriesTelegram command (now shows INDEX.md + topic file list).
6. Open decisions
- Storage location. Disk at
~/cosmo-memory/(git-versionable, Dropbox-syncable, inspectable) vs Firestore (queryable from web UI, multi-device). I lean: disk as source of truth, nightly sync to Firestore for read-only web access. - Router model. Use Haiku for the topic-file router? (Fast, cheap, good enough.) Or skip router and inject all topic files up to a token budget? With prompt caching the second option is cheaper than it sounds.
- Bi-temporal rigour. Do we really want
valid_from/valid_toon every fact, or only for the things that change (location, relationships, jobs, phase of training)? I lean: opt-in per fact class. - Dream cadence. Nightly at 3am is the obvious default. But we should probably also trigger a dream pass on
/clearsince that's when you're explicitly asking for a reset. - Migration. We have zero meaningful data in Qdrant (it's not running). So there's nothing to migrate. We can start fresh.
Validation notes
The research behind this page was independently validated by a second agent. All seven key claims traced to real primary sources. Minor caveats to note:
- Mem0 97.8% junk. verified GitHub issue mem0ai/mem0#4573. Exact numbers match. Slight overstatement: Sonnet upgrade "didn't fix" rather than "made it worse" (junk dropped 97% → 89.6%).
- Letta 74% vs Mem0 68.5% on LOCOMO. verified Letta blog. Benchmark run by Letta themselves, so treat as directional.
- Sleep-time Compute. verified with correction arXiv 2504.13171. Actual numbers: ~5× reduction in test-time compute (GSM-Symbolic + AIME), 2.5× cost-per-query reduction (different metric), up to 13% accuracy on GSM-Symbolic and 18% on AIME. The "2.5× or 5×" phrasing in my earlier summary conflated two different metrics.
- ChatGPT memory = four blocks, no RAG. reverse-engineered, not OpenAI-confirmed Gupta, Khemani. Both authors explicitly disclose they derived this by asking ChatGPT about itself. Credible but not official.
- Karpathy's LLM Wiki gist. verified gist, published Fri 3 Apr 2026. Covered by VentureBeat.
- Claude Code Auto Dream. verified, but reverse-engineered dream-skill, claude-code-secrets. Feature flag
tengu_onyx_ploverconfirmed. Four phases confirmed. Anthropic has not officially announced this — found in npm sourcemap leak. - OpenClaw. verified github.com/openclaw/openclaw — real, active project. Skills stored at
~/.openclaw/workspace/skills/<skill>/SKILL.md. "Skills-as-memory" is my framing, not theirs.