Cosmo memory
Where we actually are, what the field has figured out, and what to build next.
1. Current state: the code isn't broken, the pattern is
Audit findings:
src/memory.jsis 469 lines, complete. Qdrant client, OpenAI embeddings (text-embedding-3-small, 1536 dim), dedup at 0.92 similarity, sensitive-data filter.- Wired into
src/agent.js:717-733(read) and:1474-1493(write via[STORE_MEMORY:]directives). - Qdrant isn't running on localhost:6333. No
QDRANT_URLin.env. All calls fail silently inside try/catch. extractFactsWithSkillis imported atagent.js:53and never called.- Memory search skipped on resumed sessions (
agent.js:719).
NOT RUNNING)] M -.embed.-> O[OpenAI embeddings] A --> C[Claude Agent SDK] C -.emits STORE_MEMORY directive.-> A A -.extractFactsWithSkill
imported but never called.-> X[ ] style Q fill:#2a1818,stroke:#ff6b6b,color:#ff9a9a style X fill:none,stroke:none,color:none
2. Evidence that vector-RAG memory fails in practice
Upgrading the extractor from Gemma 2B to Claude Sonnet didn't fix it (junk rate 97% → 89.6%). Stronger models extract more indiscriminately. [1]
Caveat: benchmark run by Letta, who competes with Mem0. Treat as directional. [2]
Across everything we surveyed, the same five patterns showed up in systems that actually work:
- Filesystem > vector DB as the primitive. LLMs wield files fluently because the training data is full of filesystems. They do not wield vector stores fluently.
- Write-time categorisation beats read-time similarity. Forcing a bucket choice on the way in (Claude Code's user/feedback/project/reference, Skills in OpenClaw, profiles in LangGraph) is what prevents the junk pile.
- Sleep-time consolidation. Scheduled background pass dedupes, resolves contradictions, rewrites relative dates to absolute, prunes. Letta's sleep-time compute research shows ~5× reduction in test-time compute and up to 18% accuracy lift on AIME. [3]
- Bi-temporal facts (Zep/Graphiti). Every fact gets
valid_from/valid_to. "I moved to Adelaide" doesn't nuke "I lived in Melbourne before" — the old fact just gets invalidated. - Stop retrieving, start injecting. ChatGPT's memory feature has no vector DB. Four blocks get injected on every request. With 1M context + prompt caching, retrieval is a solution to a problem you shouldn't have (too much junk). [4]
3. Karpathy's position (the big one)
This is the single most important finding. Karpathy has been public and consistent about memory for personal agents. Three weeks ago (Fri 3 Apr 2026) he published his own working architecture.
LLMs are a bit like a coworker with anterograde amnesia. They don't consolidate or build long-running knowledge or expertise once training is over and all they have is short-term memory (context window). It's hard to build relationships (see: 50 First Dates) or do work (see: Memento) with this condition. — Andrej Karpathy, X, 4 Jun 2025
+1 for "context engineering" over "prompt engineering". In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. — Andrej Karpathy, X, 25 Jun 2025
These models don't really have a distillation phase of taking what happened, analyzing it obsessively, thinking through it, doing some synthetic data generation process and distilling it back into the weights… I'd love to have them have less memory so that they have to look things up, and they only maintain the algorithms for thought. — Karpathy on Dwarkesh Patel podcast, 17 Oct 2025
The LLM Wiki (his current, published answer)
Published as a gist on Fri 3 Apr 2026 [5]. His actual working memory architecture for personal AI use. Three layers:
Raw sources
conversations, notes,
emails, clips"] S["Layer 3
SCHEMA.md
rules for maintaining
the wiki"] W["Layer 2
The Wiki
markdown pages per topic
LLM owns this"] Q[query] L["lint pass
contradictions,
stale, orphans"] R -->|ingest| W S -.guides.-> W Q --> W L --> W style W fill:#1b2d1e,stroke:#7bd88f
Operations:
- ingest — one source at a time, LLM updates relevant wiki pages
- query — synthesise an answer from the wiki, file results back
- lint — scan for contradictions, stale claims, orphans
4. What production systems actually do
| System | Primitive | Read path | Stale facts | Verdict |
|---|---|---|---|---|
| ChatGPT memory | Text blocks injected every prompt | No retrieval — just stuff context | LLM rewrite on update | works, opaque |
| Claude Code auto-memory | MEMORY.md index + topic files | Index always loaded, files on-demand | Write-time dedup + future Auto Dream | works, inspectable |
| Claude API Memory Tool | /memories/ filesystem, CRUD |
Agent chooses via tool calls | Agent-managed | primitive, flexible |
| Mem0 | Vector + graph + KV hybrid | Embedding similarity | LLM UPDATE vs ADD | 97.8% junk in practice |
| Letta / MemGPT | Core / recall / archival tiers, OR filesystem | Agent pages between tiers | Agent rewrites core | filesystem variant wins |
| Zep / Graphiti | Bi-temporal knowledge graph | Hybrid graph + semantic | Invalidate old edges, don't overwrite | cleanest staleness model |
| Karpathy's LLM Wiki | Markdown wiki, LLM-curated | LLM reads pages by filename | Lint pass | what a smart person uses |
| Claude Code Auto Dream (unshipped) | Background consolidation pass | n/a — it's a write-time op | Orient → Gather → Consolidate → Prune | flag gated, reimplementable |
5. Proposed Cosmo architecture
The convergence across Karpathy, Claude Code, ChatGPT, Letta's benchmarks, and Geoffrey Litt's Stevens is striking. Everyone keeps arriving at: markdown files, LLM librarian, scheduled consolidation, and tasks as the same primitive as memory. That's what we build.
Cosmo is interface-agnostic
This was implicit before and I want it explicit now. Cosmo is not "the Telegram bot." Cosmo is the state. Telegram, Claude Code, and any future surface (voice, web, native app) are interfaces onto that state.
any cwd] W[Future web UI] V[Future voice] FS[(~/cosmo-memory/
filesystem
source of truth)] T --> FS CC --> FS W --> FS V --> FS FS --> T FS --> CC FS --> W FS --> V style FS fill:#1b2d1e,stroke:#7bd88f
Concretely, the memory and tasks live in one directory. Claude Code loads from it at session start (via a SessionStart hook in ~/.claude/settings.json, or by Cosmo writing an auto-generated block into ~/.claude/CLAUDE.md). Telegram loads from it on every agent turn. Nothing special happens at the interface boundary.
Storage layout
~/cosmo-memory/
├── INDEX.md # always injected, <200 lines
├── SCHEMA.md # rules the librarian follows
├── topics/ # long-term facts
│ ├── user_profile.md # name, city, preferences, stable self
│ ├── health.md # links to .claude/skills/health
│ ├── work_humankind.md
│ ├── work_h2os.md
│ ├── projects_cosmo.md
│ ├── projects_activism.md
│ ├── relationships.md
│ └── ...
├── episodes/ # raw interaction log
│ └── 2026-04/
│ └── 2026-04-24-17-45.md
├── plans/ # in-flight specs
│ └── memory-v2.md # the plan for this very system
├── tasks/ # current work
│ ├── active/
│ │ └── 2026-04-24-deploy-h2os-dispenser-47.md
│ ├── blocked/
│ └── done/2026-04/
└── inbox/ # proactive surfaces waiting for you
└── 2026-04-25.md # tomorrow's morning brief
Read path (every turn, every interface)
Telegram or Claude Code
or future surface"] --> A[Cosmo agent] A -->|always inject| I[INDEX.md
~200 lines] A -->|always inject| P[user_profile.md] A -->|always inject| AT["tasks/active/*
summarised"] A -->|router picks 1-3| T["topics/*.md"] I --> C[Claude] P --> C AT --> C T --> C C --> R[Response] style I fill:#1b2d1e,stroke:#7bd88f style P fill:#1b2d1e,stroke:#7bd88f style AT fill:#1b2d1e,stroke:#7bd88f
INDEX + profile + active-tasks summary together stay under ~4k tokens and are always warm in prompt cache. Router LLM (Sonnet 4.6) picks 1-3 topic files based on the message. No embeddings. No vector search.
Write path (end of turn)
(Sonnet 4.6, structured output)"] E -->|nothing notable| X[skip] E -->|notable fact| D{Decide} D -->|new fact| N["Append to
episodes/YYYY-MM/"] D -->|updates existing| U["Edit topic file directly"] D -->|new task| NT["Create tasks/active/YYYY-MM-DD-*.md"] D -->|task update| UT["Edit task file frontmatter"] style X fill:#1a1a1a,stroke:#555
Extractor must pick a bucket. No uncategorised blobs. If it can't pick confidently, the fact goes to episodes and waits for the dream pass to consolidate it properly.
Tasks: the Boomerang successor, explained
Tasks are markdown files. Same primitive as topics, different shape. A task file looks like:
---
title: Check H2OS dispenser 47 after firmware update
created: 2026-04-24T23:20:00+09:30
status: active
priority: medium
trigger: 2026-04-24T23:50:00+09:30 # optional: when to fire
mode: question # notify | question | review
topic: work_h2os # links into topics/
depends_on: []
---
## What
Verify dispenser 47 is online and taps are responding after
pushing firmware v2.3.1 tonight.
## Context
Rolled out to 42 and 45 first, both recovered within 5 min.
...
If trigger: is absent, it's a passive task. Sits in tasks/active/ until you work on it or the dream pass re-scores it. If trigger: is present, the checker fires the agent loop at that time, using the task file as context. That's the Boomerang concept, reborn: Cosmo doesn't just ping, it acts. But now the fire path uses Sonnet 4.6 (not hardcoded Opus), and there's no auto-cascading follow-up chain.
The dream pass (nightly 3am, with catch-up)
3am nightly
OR on startup if missed"] --> O["ORIENT
read INDEX +
last 24h episodes +
active tasks"] O --> G["GATHER
corrections
repeated patterns
neglect growth
context matches"] G --> CO["CONSOLIDATE
promote episodes into topics
resolve contradictions
rewrite dates to absolute
re-score tasks"] CO --> PR["PRUNE + INDEX
cap INDEX at 200 lines
demote low-signal
regenerate INDEX"] PR --> BR["BRIEF
write tomorrow's
inbox/YYYY-MM-DD.md"] BR --> DN[Done] style DN fill:#1b2d1e,stroke:#7bd88f
Four phases match the Auto Dream pattern [6], plus a fifth (brief) that writes the morning inbox. The scheduler persists a "last successful run" timestamp. If the machine is off at 3am, on next wake the process sees the timestamp is stale and runs immediately. No missed nights.
Inbox and proactive surfaces
Every morning the dream pass writes inbox/YYYY-MM-DD.md, delivered at 7am to whichever interface you're using (Telegram by default, Claude Code if you're already in a session). Contents:
- What I learned yesterday (memory updates)
- Contradictions I flagged (need your call)
- Tasks that have been neglected and grown in priority
- Tasks firing today (triggers + due dates)
- Questions waiting on your input
Mid-day surfaces use three modes (Harrison Chase's framework):
- Notify → silent, morning brief only, no ping.
- Question → push when it lands (agent is blocked without input).
- Review → batched into brief + evening digest, unless time-sensitive.
urgent: trueescape hatch bypasses all of this and pings immediately.
Busy detection and the judge model
When Cosmo wants to surface something mid-day, a judge (Sonnet 4.6, one-shot call) decides whether to interrupt, based on derived signals:
- Calendar: meeting now? focus block? gym?
- Recent chat activity (active conversation vs 2h+ silent)
- Time of day (early-AM protected, late-night quiet)
- Explicit toggle (
/dnd, per-task override) - Claude Code session open (if yes, possibly inject into session rather than push)
The judge returns {interrupt: bool, mode: notify|question|review, reason: "..."}. One cheap LLM call per surface attempt. Haiku/Sonnet 3.x couldn't do this reliably; Sonnet 4.6 can.
Storage: markdown is the source of truth, Firestore is an optional mirror
Your instinct was right. The point of going markdown isn't to eliminate databases, it's to fix where the truth lives. Markdown files on disk are the canonical state. Firestore is still useful as a mirror for:
- Fast queries ("all tasks with triggers in the next hour", "all tasks where
topic == work_h2os") - Multi-device read access (a future web UI on your phone)
- Cross-machine sync when you spin up the MacBook Pro as always-on (flagged as a future problem)
The dream pass writes to disk, then syncs to Firestore. Disk wins on conflict. Vector embeddings were bad because retrieval was bad. Firestore for indexing is fine.
What we delete and what we keep
- Qdrant (not running, stays off)
- OpenAI embedding calls
[STORE_MEMORY:]directive pattern- Old
cosmo-boomerangPM2 process and its Opus fire path
- Skills at
.claude/skills/(procedural memory) - Firestore
messagescollection (raw event log) /memoriescommand (now shows INDEX + topic list)- Design principles from Boomerang: parent-child chains, task-context-aware ack windows
6. Build order
Nine steps. Each one is shippable and useful alone. Steps 1-3 are the core.
- Memory v1.
~/cosmo-memory/directory. INDEX.md, SCHEMA.md, topics/, episodes/. Sonnet 4.6 router + extractor. Wired into Cosmo agent read and write path. Qdrant and embedding code deleted. - Bootstrap ingest. One-off dream-style pass that reads existing Claude Code memory at
~/.claude/projects/-Users-sahil-cosmo/memory/, session docs, skill headers, CLAUDE.md files, relevant specs. Writes initial topic files. No more cold start. - Dream pass v1. Nightly 3am consolidation of memory only (tasks come later). Catch-up logic for missed nights. Morning digest to Telegram.
- Claude Code integration. SessionStart hook or CLAUDE.md auto-block that surfaces INDEX + active tasks into every Claude Code session. Same state across interfaces.
- Tasks directory.
tasks/active/,tasks/blocked/,tasks/done/YYYY-MM/. Read path injects active-tasks summary. Write path can create/update. - Plans directory.
plans/holds feature specs. Starts withplans/memory-v2.md. Meta but appropriate. - Inbox and morning brief. Dream pass starts writing
inbox/YYYY-MM-DD.md. Delivered at 7am. - Triggers. Task entries can carry
trigger:field. Cheap checker fires the regular agent loop at the right time. Boomerang reborn, Sonnet-powered, no auto-cascade. - Judge and polish. Sonnet 4.6 interrupt classifier with busy signals. Web dashboard (Cloudflare Pages on
cosmo.earthlings.workers.dev) rendering tasks/inbox/topics from the markdown. Slash commands for memory/tasks operations in both Telegram and Claude Code.
Rough schedule: steps 1-3 is one solid day of focused work. Steps 4-7 is another day. Steps 8-9 is the long tail.
7. Still-open decisions
Everything else is settled. These are the ones that actually need you:
- Multi-machine coordination. You mentioned setting up the old MacBook Pro as always-on. At that point we need to decide: is the filesystem source of truth on one machine and the other machines read over SSH/syncthing? Or both machines talk to Firestore as the shared bus? Parked until the hardware is live.
- Bi-temporal rigour. Opt-in per fact class (location, relationships, jobs, phases of training) or blanket on every fact? I lean opt-in. Cheap to tighten later.
- Router vs no-router. Sonnet 4.6 picks 1-3 topic files (my default), or inject all topic files up to a token budget and let prompt caching carry the cost? I lean router first, switch to "inject all" if tokens are cheap enough at scale.
Everything else (storage = disk, bootstrap = yes, modes = notify/question/review with defaults, boomerang = principles not code, dream 3am with catch-up) is decided.
Validation notes
The research behind this page was independently validated by a second agent. All seven key claims traced to real primary sources. Minor caveats to note:
- Mem0 97.8% junk. verified GitHub issue mem0ai/mem0#4573. Exact numbers match. Slight overstatement: Sonnet upgrade "didn't fix" rather than "made it worse" (junk dropped 97% → 89.6%).
- Letta 74% vs Mem0 68.5% on LOCOMO. verified Letta blog. Benchmark run by Letta themselves, so treat as directional.
- Sleep-time Compute. verified with correction arXiv 2504.13171. Actual numbers: ~5× reduction in test-time compute (GSM-Symbolic + AIME), 2.5× cost-per-query reduction (different metric), up to 13% accuracy on GSM-Symbolic and 18% on AIME. The "2.5× or 5×" phrasing in my earlier summary conflated two different metrics.
- ChatGPT memory = four blocks, no RAG. reverse-engineered, not OpenAI-confirmed Gupta, Khemani. Both authors explicitly disclose they derived this by asking ChatGPT about itself. Credible but not official.
- Karpathy's LLM Wiki gist. verified gist, published Fri 3 Apr 2026. Covered by VentureBeat.
- Claude Code Auto Dream. verified, but reverse-engineered dream-skill, claude-code-secrets. Feature flag
tengu_onyx_ploverconfirmed. Four phases confirmed. Anthropic has not officially announced this — found in npm sourcemap leak. - OpenClaw. verified github.com/openclaw/openclaw — real, active project. Skills stored at
~/.openclaw/workspace/skills/<skill>/SKILL.md. "Skills-as-memory" is my framing, not theirs.