Building an AI Self-Improvement Loop from Claude Code Session History
Building an AI Self-Improvement Loop from Claude Code Session History
The Problem
As a heavy Claude Code user (241 messages/day across coding, documentation, data analysis, and operational tasks), I accumulated 7 months of conversation history — 735 sessions across 30+ projects. That history contains domain knowledge, decisions, corrections, friction patterns, and workflow templates that get re-derived from scratch in every new session.
I’ve tried organizing this knowledge before: PARA method, skills backlog, retrospectives, video transcript extraction. Every system died from the same three causes: no time to maintain (client delivery always wins), too much friction to use (too many steps to file properly), and no immediate payoff (value is for future-me, but present-me is fighting fires).
The Vision
A system that consumes its own output to improve itself — an ouroboros. Session history feeds knowledge extraction, which feeds better rules/skills/hooks, which produce better sessions, which feed more improvements. The loop must be self-maintaining or it shares the fate of every prior organizational attempt.
Inspired by:
- Karpathy’s LLM Wiki pattern — raw sources → LLM compilation → queryable markdown wiki
- A YouTube implementation extending the pattern for internal session data via Claude Code hooks
- Ouroboros — a self-modifying AI agent governed by a philosophical constitution
- Phantom — an autonomous AI co-worker with three-tier vector memory (episodic/semantic/procedural), a 6-step self-evolution pipeline with safety gates, and MCP server exposure
- Claude Code Karma — open-source dashboard that parses Claude Code’s local JSONL session data with production-grade models (7200+ lines of parsers), providing the data layer for session analysis
What We Did in One Session
Phase 1: Problem Space Mapping
Explored 11 of my projects to understand the full ecosystem:
- A session dashboard tool with production-grade JSONL parsers for Claude Code data (7200+ lines of parsing models)
- An autonomous agent deployed on Hetzner with three-tier vector memory, 6-step self-evolution pipeline, and MCP server
- A professional portfolio site showing expertise areas and services
- An n8n orchestration PoC (Symfony + n8n + PostgreSQL) showing workflow automation patterns
- A skills project with 6 implemented skills, 28 unimplemented feature requests, 3 retrospectives never acted on — the case study of the broken feedback loop
- A major client project with 549 Claude Code sessions, 502 git commits, 60+ bug reports, 14 meeting transcripts, 17+ runbooks
- An Obsidian vault proving the Karpathy wiki pattern works (13 pages extracted from codebase analysis)
Also examined 4 external references (Karpathy gist, video implementation, Ouroboros agent, 3-year WhatsApp capture vault in capacities.io).
Phase 2: Structured Interview
Used elicitation techniques to uncover:
- Daily workflow vision: Open laptop → see what matters today → work human-in-the-loop at 100x speed → fail 200 times → iteration 201 dazzles. “I talked to Gabi and in 2h it was perfect.”
- Why organizational systems die: No time + too much friction + no immediate payoff. All three must be solved simultaneously.
- Compounding returns: Every session makes the next one more efficient. The self-improvement loop isn’t a side project — it’s core to how I work.
- Priority chain: (1) Knowledge capture → (2) Nightly improvement proposals → (3) Morning briefing
Phase 3: Tool Building
Built two tools during the session:
Session Analyzer CLI — A Python script that navigates Claude Code JSONL files using the dashboard’s existing parsers. Commands: overview, page N, message N --context, friction, artifacts, subagents. The subagent analysis shows reasoning chains, repeated search patterns, and “hindsight lessons” (what prior knowledge would have made the agent more efficient).
MCP Server from OpenAPI — Used FastMCP to auto-generate 101 MCP tools from the dashboard’s REST API. ~10 lines of code. Every endpoint (session timeline, subagent details, tool usage, file activity, analytics) becomes a tool Claude can call natively during any session. Already wired into Claude Code’s MCP config.
Phase 4: 100-Session Analysis at Scale
Selected 100 substantial sessions (>50KB) across all projects. Distributed them to 10 parallel subagents, each analyzing 10 sessions using the analyzer tool. Each agent produced an insight report covering: session summaries, domain knowledge, friction patterns, subagent efficiency, improvement opportunities.
Results: ~145KB of insights, 1783 lines across 10 reports.
Cross-cutting findings:
| Finding | Frequency |
|---|---|
| Claude speculates instead of querying actual data | 6/10 groups |
| Subagents spend 50%+ of tool calls discovering file locations | 5/10 groups |
| Friction keyword detector catches only ~20-30% of real events | 3/10 groups |
| Investigation-to-communication iterations equal investigation iterations | 3/10 groups |
| Domain knowledge (entity relationships, API quirks) discovered multiple times, never persisted | 4/10 groups |
Phase 5: The Critical Insight
The most important finding came from examining what context was active during friction events. When Claude writes technical acceptance criteria instead of behavioral ones, the user-story-writer skill is loaded and explicitly says: “Acceptance criteria are detailed behavioral specifications, NOT test cases.” When Claude speculates about root causes, the bug-reporter skill is loaded and explicitly says: “Bug reports are factual documentation, NOT speculation.”
The problem is not missing rules — it’s a broken improvement loop.
Skills are v1. Users correct Claude in sessions. Those corrections contain the exact before/after data needed to improve the skills. But corrections never flow back into skill updates. So the same friction recurs across sessions.
The self-improvement system must close this gap:
Session friction detected (e.g., "Claude wrote technical ACs")
↓
Check what skill/CLAUDE.md was active at that moment
↓
Extract the correction (before: what Claude did, after: what user wanted)
↓
Propose specific skill amendment using actual correction examples
↓
Apply (auto or human-approved) → Skill v2
↓
Next session: less friction → validate improvement worked
What’s Next
- Synthesize the 10 insight reports into a master findings document with prioritized improvements
- Build the correction-to-skill pipeline — for each friction event where a skill was active, extract before/after and propose skill amendments
- Expand friction detection — current keyword matching misses ~70% of real friction events
- Test the MCP server live — 101 tools wired up, not yet tested in a real session
- Build codebase navigation maps — reduce subagent search thrashing by ~50%
- Set up the nightly improvement job — automated analysis → proposals → morning briefing
- Prototype Obsidian vault pages from the 100-session findings
Takeaways for the Community
-
Your Claude Code session history is a goldmine. Every correction you make, every friction event, every domain fact discovered — it’s all in the JSONL files. Mine it.
-
The Karpathy wiki pattern works for internal data, not just external articles. Session logs = raw sources. LLM compilation = knowledge extraction. Wiki = your compounding knowledge base.
-
FastMCP + OpenAPI = instant MCP server. If your tool has a REST API, you can expose it as 100+ MCP tools in 10 lines of Python. No hand-coding tool definitions.
-
10 parallel subagents can analyze 100 sessions in ~5 minutes. The bottleneck is synthesis, not analysis.
-
The biggest insight isn’t about adding more rules — it’s about closing the feedback loop. Your skills and CLAUDE.md rules are probably already decent. The problem is that session corrections never flow back into improving them. Build the loop, not more rules.
-
Every organizational system you’ve tried died for the same reason: it required your ongoing attention. The self-improvement loop must be self-maintaining or it shares their fate.