Cosmo memory

Where we actually are, what the field has figured out, and what to build next. Fri 24 Apr 2026.

Listen to the whole thing — tap to start, auto-plays each section
Contents
  1. Current state: the code isn't broken, the pattern is
  2. Evidence that vector-RAG memory fails in practice
  3. Karpathy's position (the big one)
  4. What production systems actually do
  5. Proposed Cosmo architecture (memory + tasks unified)
  6. Build order
  7. Still-open decisions
  8. Validation notes

1. Current state: the code isn't broken, the pattern is

Audit findings:

flowchart TD U[User via Telegram] --> B[cosmo-bot] B --> A[cosmo-agent] A -.search on NEW session only.-> M[memory.js] M -.connect.-> Q[(Qdrant localhost:6333
NOT RUNNING)] M -.embed.-> O[OpenAI embeddings] A --> C[Claude Agent SDK] C -.emits STORE_MEMORY directive.-> A A -.extractFactsWithSkill
imported but never called.-> X[ ] style Q fill:#2a1818,stroke:#ff6b6b,color:#ff9a9a style X fill:none,stroke:none,color:none
The real problem isn't Qdrant being down. Flipping it on would just start producing junk. The Mem0 pattern (LLM extracts facts, embeds, vector-searches) has been independently measured at ~97.8% junk rate in production [1]. We'd be turning on a broken pattern.

2. Evidence that vector-RAG memory fails in practice

97.8%
of Mem0 entries were junk after a 32-day production audit (10,134 entries). Breakdown: 52.7% system-prompt re-extractions, 11.5% cron/heartbeat noise, 5.2% hallucinated profiles, plus 808 copies of one feedback-loop hallucination.

Upgrading the extractor from Gemma 2B to Claude Sonnet didn't fix it (junk rate 97% → 89.6%). Stronger models extract more indiscriminately. [1]
+5.5pp
LOCOMO benchmark: GPT-4o-mini with a plain filesystem (grep, open, search_files) scores 74.0% vs Mem0's graph variant at 68.5%. The dumb filesystem beats the specialised memory layer.

Caveat: benchmark run by Letta, who competes with Mem0. Treat as directional. [2]

Across everything we surveyed, the same five patterns showed up in systems that actually work:

  1. Filesystem > vector DB as the primitive. LLMs wield files fluently because the training data is full of filesystems. They do not wield vector stores fluently.
  2. Write-time categorisation beats read-time similarity. Forcing a bucket choice on the way in (Claude Code's user/feedback/project/reference, Skills in OpenClaw, profiles in LangGraph) is what prevents the junk pile.
  3. Sleep-time consolidation. Scheduled background pass dedupes, resolves contradictions, rewrites relative dates to absolute, prunes. Letta's sleep-time compute research shows ~5× reduction in test-time compute and up to 18% accuracy lift on AIME. [3]
  4. Bi-temporal facts (Zep/Graphiti). Every fact gets valid_from / valid_to. "I moved to Adelaide" doesn't nuke "I lived in Melbourne before" — the old fact just gets invalidated.
  5. Stop retrieving, start injecting. ChatGPT's memory feature has no vector DB. Four blocks get injected on every request. With 1M context + prompt caching, retrieval is a solution to a problem you shouldn't have (too much junk). [4]

3. Karpathy's position (the big one)

This is the single most important finding. Karpathy has been public and consistent about memory for personal agents. Three weeks ago (Fri 3 Apr 2026) he published his own working architecture.

LLMs are a bit like a coworker with anterograde amnesia. They don't consolidate or build long-running knowledge or expertise once training is over and all they have is short-term memory (context window). It's hard to build relationships (see: 50 First Dates) or do work (see: Memento) with this condition. — Andrej Karpathy, X, 4 Jun 2025
+1 for "context engineering" over "prompt engineering". In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. — Andrej Karpathy, X, 25 Jun 2025
These models don't really have a distillation phase of taking what happened, analyzing it obsessively, thinking through it, doing some synthetic data generation process and distilling it back into the weights… I'd love to have them have less memory so that they have to look things up, and they only maintain the algorithms for thought. — Karpathy on Dwarkesh Patel podcast, 17 Oct 2025

The LLM Wiki (his current, published answer)

Published as a gist on Fri 3 Apr 2026 [5]. His actual working memory architecture for personal AI use. Three layers:

flowchart TD R["Layer 1
Raw sources
conversations, notes,
emails, clips"] S["Layer 3
SCHEMA.md
rules for maintaining
the wiki"] W["Layer 2
The Wiki
markdown pages per topic
LLM owns this"] Q[query] L["lint pass
contradictions,
stale, orphans"] R -->|ingest| W S -.guides.-> W Q --> W L --> W style W fill:#1b2d1e,stroke:#7bd88f

Operations:

Karpathy's explicit framing: no vector databases, no RAG pipelines, just markdown + an LLM acting as a full-time librarian. He argues RAG re-derives knowledge on every query — stateless, amnesiac, wasteful. A wiki does the cognitive lift once, at ingestion, and keeps a structured interlinked artifact. VentureBeat coverage.

4. What production systems actually do

SystemPrimitiveRead pathStale factsVerdict
ChatGPT memory Text blocks injected every prompt No retrieval — just stuff context LLM rewrite on update works, opaque
Claude Code auto-memory MEMORY.md index + topic files Index always loaded, files on-demand Write-time dedup + future Auto Dream works, inspectable
Claude API Memory Tool /memories/ filesystem, CRUD Agent chooses via tool calls Agent-managed primitive, flexible
Mem0 Vector + graph + KV hybrid Embedding similarity LLM UPDATE vs ADD 97.8% junk in practice
Letta / MemGPT Core / recall / archival tiers, OR filesystem Agent pages between tiers Agent rewrites core filesystem variant wins
Zep / Graphiti Bi-temporal knowledge graph Hybrid graph + semantic Invalidate old edges, don't overwrite cleanest staleness model
Karpathy's LLM Wiki Markdown wiki, LLM-curated LLM reads pages by filename Lint pass what a smart person uses
Claude Code Auto Dream (unshipped) Background consolidation pass n/a — it's a write-time op Orient → Gather → Consolidate → Prune flag gated, reimplementable

5. Proposed Cosmo architecture

The convergence across Karpathy, Claude Code, ChatGPT, Letta's benchmarks, and Geoffrey Litt's Stevens is striking. Everyone keeps arriving at: markdown files, LLM librarian, scheduled consolidation, and tasks as the same primitive as memory. That's what we build.

Cosmo is interface-agnostic

This was implicit before and I want it explicit now. Cosmo is not "the Telegram bot." Cosmo is the state. Telegram, Claude Code, and any future surface (voice, web, native app) are interfaces onto that state.

flowchart TD T[Telegram] CC[Claude Code sessions
any cwd] W[Future web UI] V[Future voice] FS[(~/cosmo-memory/
filesystem
source of truth)] T --> FS CC --> FS W --> FS V --> FS FS --> T FS --> CC FS --> W FS --> V style FS fill:#1b2d1e,stroke:#7bd88f

Concretely, the memory and tasks live in one directory. Claude Code loads from it at session start (via a SessionStart hook in ~/.claude/settings.json, or by Cosmo writing an auto-generated block into ~/.claude/CLAUDE.md). Telegram loads from it on every agent turn. Nothing special happens at the interface boundary.

Storage layout

~/cosmo-memory/
├── INDEX.md                    # always injected, <200 lines
├── SCHEMA.md                   # rules the librarian follows
├── topics/                     # long-term facts
│   ├── user_profile.md         # name, city, preferences, stable self
│   ├── health.md               # links to .claude/skills/health
│   ├── work_humankind.md
│   ├── work_h2os.md
│   ├── projects_cosmo.md
│   ├── projects_activism.md
│   ├── relationships.md
│   └── ...
├── episodes/                   # raw interaction log
│   └── 2026-04/
│       └── 2026-04-24-17-45.md
├── plans/                      # in-flight specs
│   └── memory-v2.md            # the plan for this very system
├── tasks/                      # current work
│   ├── active/
│   │   └── 2026-04-24-deploy-h2os-dispenser-47.md
│   ├── blocked/
│   └── done/2026-04/
└── inbox/                      # proactive surfaces waiting for you
    └── 2026-04-25.md           # tomorrow's morning brief

Read path (every turn, every interface)

flowchart TD IN["Any interface
Telegram or Claude Code
or future surface"] --> A[Cosmo agent] A -->|always inject| I[INDEX.md
~200 lines] A -->|always inject| P[user_profile.md] A -->|always inject| AT["tasks/active/*
summarised"] A -->|router picks 1-3| T["topics/*.md"] I --> C[Claude] P --> C AT --> C T --> C C --> R[Response] style I fill:#1b2d1e,stroke:#7bd88f style P fill:#1b2d1e,stroke:#7bd88f style AT fill:#1b2d1e,stroke:#7bd88f

INDEX + profile + active-tasks summary together stay under ~4k tokens and are always warm in prompt cache. Router LLM (Sonnet 4.6) picks 1-3 topic files based on the message. No embeddings. No vector search.

Write path (end of turn)

flowchart TD R[Full turn exchange] --> E["Extractor
(Sonnet 4.6, structured output)"] E -->|nothing notable| X[skip] E -->|notable fact| D{Decide} D -->|new fact| N["Append to
episodes/YYYY-MM/"] D -->|updates existing| U["Edit topic file directly"] D -->|new task| NT["Create tasks/active/YYYY-MM-DD-*.md"] D -->|task update| UT["Edit task file frontmatter"] style X fill:#1a1a1a,stroke:#555

Extractor must pick a bucket. No uncategorised blobs. If it can't pick confidently, the fact goes to episodes and waits for the dream pass to consolidate it properly.

Tasks: the Boomerang successor, explained

Tasks are markdown files. Same primitive as topics, different shape. A task file looks like:

---
title: Check H2OS dispenser 47 after firmware update
created: 2026-04-24T23:20:00+09:30
status: active
priority: medium
trigger: 2026-04-24T23:50:00+09:30   # optional: when to fire
mode: question                        # notify | question | review
topic: work_h2os                      # links into topics/
depends_on: []
---

## What
Verify dispenser 47 is online and taps are responding after
pushing firmware v2.3.1 tonight.

## Context
Rolled out to 42 and 45 first, both recovered within 5 min.
...

If trigger: is absent, it's a passive task. Sits in tasks/active/ until you work on it or the dream pass re-scores it. If trigger: is present, the checker fires the agent loop at that time, using the task file as context. That's the Boomerang concept, reborn: Cosmo doesn't just ping, it acts. But now the fire path uses Sonnet 4.6 (not hardcoded Opus), and there's no auto-cascading follow-up chain.

The dream pass (nightly 3am, with catch-up)

flowchart TD S["Scheduler
3am nightly
OR on startup if missed"] --> O["ORIENT
read INDEX +
last 24h episodes +
active tasks"] O --> G["GATHER
corrections
repeated patterns
neglect growth
context matches"] G --> CO["CONSOLIDATE
promote episodes into topics
resolve contradictions
rewrite dates to absolute
re-score tasks"] CO --> PR["PRUNE + INDEX
cap INDEX at 200 lines
demote low-signal
regenerate INDEX"] PR --> BR["BRIEF
write tomorrow's
inbox/YYYY-MM-DD.md"] BR --> DN[Done] style DN fill:#1b2d1e,stroke:#7bd88f

Four phases match the Auto Dream pattern [6], plus a fifth (brief) that writes the morning inbox. The scheduler persists a "last successful run" timestamp. If the machine is off at 3am, on next wake the process sees the timestamp is stale and runs immediately. No missed nights.

Inbox and proactive surfaces

Every morning the dream pass writes inbox/YYYY-MM-DD.md, delivered at 7am to whichever interface you're using (Telegram by default, Claude Code if you're already in a session). Contents:

Mid-day surfaces use three modes (Harrison Chase's framework):

Busy detection and the judge model

When Cosmo wants to surface something mid-day, a judge (Sonnet 4.6, one-shot call) decides whether to interrupt, based on derived signals:

The judge returns {interrupt: bool, mode: notify|question|review, reason: "..."}. One cheap LLM call per surface attempt. Haiku/Sonnet 3.x couldn't do this reliably; Sonnet 4.6 can.

Storage: markdown is the source of truth, Firestore is an optional mirror

Your instinct was right. The point of going markdown isn't to eliminate databases, it's to fix where the truth lives. Markdown files on disk are the canonical state. Firestore is still useful as a mirror for:

The dream pass writes to disk, then syncs to Firestore. Disk wins on conflict. Vector embeddings were bad because retrieval was bad. Firestore for indexing is fine.

What we delete and what we keep

Delete
  • Qdrant (not running, stays off)
  • OpenAI embedding calls
  • [STORE_MEMORY:] directive pattern
  • Old cosmo-boomerang PM2 process and its Opus fire path
Keep
  • Skills at .claude/skills/ (procedural memory)
  • Firestore messages collection (raw event log)
  • /memories command (now shows INDEX + topic list)
  • Design principles from Boomerang: parent-child chains, task-context-aware ack windows

6. Build order

Nine steps. Each one is shippable and useful alone. Steps 1-3 are the core.

  1. Memory v1. ~/cosmo-memory/ directory. INDEX.md, SCHEMA.md, topics/, episodes/. Sonnet 4.6 router + extractor. Wired into Cosmo agent read and write path. Qdrant and embedding code deleted.
  2. Bootstrap ingest. One-off dream-style pass that reads existing Claude Code memory at ~/.claude/projects/-Users-sahil-cosmo/memory/, session docs, skill headers, CLAUDE.md files, relevant specs. Writes initial topic files. No more cold start.
  3. Dream pass v1. Nightly 3am consolidation of memory only (tasks come later). Catch-up logic for missed nights. Morning digest to Telegram.
  4. Claude Code integration. SessionStart hook or CLAUDE.md auto-block that surfaces INDEX + active tasks into every Claude Code session. Same state across interfaces.
  5. Tasks directory. tasks/active/, tasks/blocked/, tasks/done/YYYY-MM/. Read path injects active-tasks summary. Write path can create/update.
  6. Plans directory. plans/ holds feature specs. Starts with plans/memory-v2.md. Meta but appropriate.
  7. Inbox and morning brief. Dream pass starts writing inbox/YYYY-MM-DD.md. Delivered at 7am.
  8. Triggers. Task entries can carry trigger: field. Cheap checker fires the regular agent loop at the right time. Boomerang reborn, Sonnet-powered, no auto-cascade.
  9. Judge and polish. Sonnet 4.6 interrupt classifier with busy signals. Web dashboard (Cloudflare Pages on cosmo.earthlings.workers.dev) rendering tasks/inbox/topics from the markdown. Slash commands for memory/tasks operations in both Telegram and Claude Code.

Rough schedule: steps 1-3 is one solid day of focused work. Steps 4-7 is another day. Steps 8-9 is the long tail.

7. Still-open decisions

Everything else is settled. These are the ones that actually need you:

  1. Multi-machine coordination. You mentioned setting up the old MacBook Pro as always-on. At that point we need to decide: is the filesystem source of truth on one machine and the other machines read over SSH/syncthing? Or both machines talk to Firestore as the shared bus? Parked until the hardware is live.
  2. Bi-temporal rigour. Opt-in per fact class (location, relationships, jobs, phases of training) or blanket on every fact? I lean opt-in. Cheap to tighten later.
  3. Router vs no-router. Sonnet 4.6 picks 1-3 topic files (my default), or inject all topic files up to a token budget and let prompt caching carry the cost? I lean router first, switch to "inject all" if tokens are cheap enough at scale.

Everything else (storage = disk, bootstrap = yes, modes = notify/question/review with defaults, boomerang = principles not code, dream 3am with catch-up) is decided.


Validation notes

The research behind this page was independently validated by a second agent. All seven key claims traced to real primary sources. Minor caveats to note:

  1. Mem0 97.8% junk. verified GitHub issue mem0ai/mem0#4573. Exact numbers match. Slight overstatement: Sonnet upgrade "didn't fix" rather than "made it worse" (junk dropped 97% → 89.6%).
  2. Letta 74% vs Mem0 68.5% on LOCOMO. verified Letta blog. Benchmark run by Letta themselves, so treat as directional.
  3. Sleep-time Compute. verified with correction arXiv 2504.13171. Actual numbers: ~5× reduction in test-time compute (GSM-Symbolic + AIME), 2.5× cost-per-query reduction (different metric), up to 13% accuracy on GSM-Symbolic and 18% on AIME. The "2.5× or 5×" phrasing in my earlier summary conflated two different metrics.
  4. ChatGPT memory = four blocks, no RAG. reverse-engineered, not OpenAI-confirmed Gupta, Khemani. Both authors explicitly disclose they derived this by asking ChatGPT about itself. Credible but not official.
  5. Karpathy's LLM Wiki gist. verified gist, published Fri 3 Apr 2026. Covered by VentureBeat.
  6. Claude Code Auto Dream. verified, but reverse-engineered dream-skill, claude-code-secrets. Feature flag tengu_onyx_plover confirmed. Four phases confirmed. Anthropic has not officially announced this — found in npm sourcemap leak.
  7. OpenClaw. verified github.com/openclaw/openclaw — real, active project. Skills stored at ~/.openclaw/workspace/skills/<skill>/SKILL.md. "Skills-as-memory" is my framing, not theirs.

Sources pulled Fri 24 Apr 2026, ~16:00 ACST. Document saved at specs/research/memory-architecture-proposal.html.

Now playing