In April and June 2026, GitHub Copilot, Cursor, and Claude Code all tightened their pricing in the same six-week window. Copilot moved to token-based billing on June 1, sending some power users from \$29/month to \$750/month. Cursor's real cost for daily agent use landed at \$40 to \$80, not the advertised \$20. The flat-rate era for AI coding tools is over. Every unnecessary token your agent burns re-reading files it already knows is now directly hitting your bill.

I designed Zerikai Memory to address this directly. It is a local MCP server that any IDE can connect to; Cursor, VS Code, Claude Desktop, pi. It parses your codebase with tree-sitter, capturing entities and code descriptions (functions, classes, docstrings and more) into a local ChromaDB vector store, on querying, it retrieves relevant context through L2 and Lexical re-indexing with source verification (Entity, File, Line Number and L2). It runs entirely on your machine. No code ever leaves it.
This is the engineering case for Zerikai Memory over graph-based approaches: why deterministic parsing matters when your data is source code, how entity-level retrieval keeps your context window from filling up, and what building a developer-controlled memory layer taught me about the limits of AI automation.
Two Memory Domains: Why Conversations and Code Need Different Tools
There is a distinction that most memory-layer discussions miss:
Conversational memory is fluid. A user's preferences change, a support ticket gets updated, a project decision evolves over time. Memory graphs (nodes + edges, temporal tracking, multi-hop reasoning) make sense here. You need the AI to know that "User X bought a product on Tuesday and filed a ticket on Friday," and you need that to update as events happen.
Codebase memory is not fluid. Code is binary ground truth. It compiles or it doesn't. Tests pass or they don't. If you are mid-refactor and the code is broken, you don't want your AI indexing that mess, you want it indexing the stable snapshot after you have committed.
This is why I used tree-sitter instead of a memory graph. Tree-sitter is a deterministic local parser. It extracts exact syntax trees, functions, classes, methods, docstrings, with zero API calls and zero hallucination risk in the index itself. A memory graph would attempt to "summarize" or "evolve" connections between code entities, which is exactly how you get an LLM confidently telling you that "process_data()" calls "validate_input()" when the actual call chain goes through three decorators and a factory function it can't resolve.
Although today we can program by chatting with an AI agent, code is not a conversation. It is a rigid, deterministic structure. Even inside an AI-enabled app, the prompts embedded in the codebase are just strings. The code treats them no differently than a database query or an error message. The memory layer should match: deterministic parsing of strings, not probabilistic summarization of intent.
The Problem
You are paying for the same context twice, and it is bleeding your budget dry. Or you are using a continuous graph that records every interaction but lacks a deterministic view of the codebase.
VS Code Copilot, Cursor, and other AI IDEs have built-in memory layers that attempt to persist context across sessions. However, these often rely on proprietary vector databases or graph structures that are opaque to the user. You can't tap them from other IDEs or CLI tools.
Graph-based memory systems (like Graphiti) promise to model relationships and temporal dynamics, but they struggle with the static, deterministic nature of code. They can't easily track when a function was last updated or how it relates to other entities without complex multi-hop reasoning that often leads to hallucinations.
Manual context injection (copy-pasting code snippets into the prompt) is a common workaround, but it quickly eats up your token budget and becomes unmanageable as your codebase grows.
What zerikai_memory Actually Does
It's a Python MCP server that sits between your IDE and your LLM:
Your Codebase → tree-sitter (local parse) → ChromaDB (zerikai_memory/.brain/)
│
Your IDE → MCP Server (:stdio) → │
▼
DeepSeek / (Ollama)
(auto-routed synthesis)
The 4-stage query pipeline:
-
L1: Vector Search (ChromaDB). The query is embedded and matched against stored entities. Results above a configurable L2 distance threshold are dropped. Vector search is probabilistic, a query can return a near miss. But because the index is built from real parsed code, not summaries, a wrong answer is still a real function at a real file:line you can verify.
-
L2: Lexical Re-ranking. Surviving results are re-scored using
(1/distance) + (keyword_hits × weight)to boost matches on entity names and docstrings. This fixes a known failure mode of pure semantic search: a generic file summary mentioning "parsing" can outscore the actualparse_entities()function in cosine space. The re-ranker givesextract_entitiesthe edge. Nothing is dropped, it's a pure reorder. -
L3: Auto-routing (hybrid mode)*. Short, specific queries ("where is
process_datadefined?") hit Ollama locally for free. Architectural queries ("explain how the routing pipeline decides between flash and pro models") escalate to DeepSeek. The router uses a 4-step priority chain: explicit override → keyword detection → length threshold →.envfallback. -
L4: LLM Synthesis with inline citations. Every answer includes
#file.py:line (distance)citations. The agent doesn't hunt for the function; the citation gives it the exact file, line number, and L2 distance score. In VS Code Copilot, these are clickable. The developer can jump straight to the source and verify.
L3 is optional:* You can set ROUTER_MODE=cloud in the .env and all the queries go to DeepSeek**.
Why entity-level indexing matters:
Most RAG setups (including ones built into commercial IDEs) chunk files by token count or line count. A function might get split across two chunks, or buried under 500 lines of imports. Feeding raw files to an agent often injects 5,000 tokens of boilerplate to find one function. zerikai_memory retrieves the same function in roughly 200 tokens of targeted citations. The difference compounds across every query in every session.
zerikai_memory uses tree-sitter to extract each function, class, method, or component as its own atomic unit. The extract_entities function in code_indexer.py returns a list of CodeEntity objects, one per parsed symbol. Clean signal, no noise.
# From code_indexer.py: each entity is a self-contained retrieval unit @dataclass class CodeEntity: name: str entity_type: str # function, class, method, component language: str file_path: str line_number: int signature: str # def process_data(items: List[Item]) -> Result docstring: str parent_class: Optional[str] # ... metadata
When the agent queries memory, it gets back a compact payload with source citations:
Sources: extract_entities #code_indexer.py:184 (0.35) _extract_js_function #code_indexer.py:595 (0.81) _extract_js_like #code_indexer.py:543 (0.89) CodeEntity #code_indexer.py:156 (0.83)
The agent uses #file:line (distance) for reasoning regardless. What you see is filtered by your IDE's display layer, some agents surface the distance, some hide it, but the data is always in context.
Note: the
code_indexer.pycurrently parses Python, JS/TS, HTML, CSS, and Markdown. More languages are on the way, tree-sitter grammars exist for most of them; I just haven't wired them all up yet. Feel free to index zerikai_memory itself and tell it to add a new tree-sitter grammar for Go, Rust, Java, C#, etc.
Beyond the IDE: Why a Snapshot Beats a Session
A memory layer tied to your IDE session dies when you close the window. A snapshot stored on disk lives wherever you point an MCP client at it on the same machine.
That means zerikai_memory works from anywhere that speaks MCP, not just your editor:
- Automation and research. An N8N workflow can query the snapshot, pull entity data, and feed it into a pipeline. Claude Desktop can fact-check a writeup about a codebase by querying its memory directly, with inline citations pointing back to source.
- IDE switching and onboarding. Open the same project in Cursor, VS Code, or pi. The snapshot is at
zerikai_memory/.brain/. No re-indexing. No re-explaining. A new developer clones the repo, installs zerikai_memory, runsscan_workspace, and their first query pulls the architecture, key files, and conventions from the brief. Each developer generates their own snapshot locally. - Client and consulting work. Estimate the blast radius of a bug fix or the cost of a new feature by querying the project snapshot. Send a status report backed by actual entity-level context, not your memory of the last meeting.
Yes, Claude's project knowledge feature can do some of this. It also costs $20/month and locks you into one vendor. This runs locally, costs nothing, and works across Claude, Cursor, Copilot, and anything else that speaks MCP.
None of this works with a session-bound memory layer. It works because the index lives at zerikai_memory/.brain/ on disk and the same MCP server answers queries regardless of which local tool is asking. You control the snapshot. You decide when it gets updated.
Zero-Cost Indexing and the DeepSeek KV Cache Trick
tree-sitter + local ChromaDB embeddings cost $0.00 to index your codebase. No API calls, no GPU, no per-token charges. The only thing that touches a paid API is the project brief: a one-time LLM-generated summary of your codebase, structured into 9 sections:
- Overview
- Technical Stack
- Core Architecture
- Primary Conventions
- Purpose
- Key Files
- Data Flow
- Development & Testing
- Future Roadmap
That brief (~600 tokens) is locked and cached after generation. Why lock it? Because it becomes the fixed prefix of every system message sent to DeepSeek. Here is what that looks like in practice:
[SYSTEM - IDENTICAL EVERY CALL, CACHED]: You are an AI coding assistant. Below is the project context: PROJECT BRIEF: Overview: A React dashboard for monitoring CI/CD pipelines... Technical Stack: Next.js 14, Prisma, PostgreSQL, Redis... Core Architecture: App Router -> API handlers -> Prisma -> Postgres... [... 600 tokens total, never changes between scans] [USER - CHANGES EVERY CALL, NOT CACHED]: "Where is get_supported_extensions defined?"
The brief portion is identical across queries, so DeepSeek computes it once and reuses the cached result. This cache persists across separate API calls on DeepSeek's servers, not just within a single session.
DeepSeek's KV cache is automatic and server-side. When the system message prefix is identical across API calls, DeepSeek reuses the cached computation for that prefix. The first query of a session warms the cache (cache miss, paid at \$0.14/M tokens). Every subsequent query hits the cache at $0.0028/M tokens. That's a 50× reduction on the largest token block per call.
If your architecture changes significantly, you trigger update_brief (a command that regenerates the brief via the LLM) after re-scanning. That's a deliberate trade-off, cache stability over automatic freshness. The brief stays locked because the cost savings compound across every query in every session.
Local mode is also fully supported: set MEMORY_MODE=local, point at a running Ollama instance, and everything (parsing, embedding, retrieval, synthesis) runs on your machine. Zero cloud dependency. The inline citations still work. Token tracking still logs (it just logs $0.00).
The re-indexing model: developer as gatekeeper
This is where the architecture makes an explicit choice that some tooling philosophies disagree with.
Re-indexing is developer-directed. You tell your AI Agent to scan_workspace when you've reached a stable state. This command walks your project directory, re-parses every source file with tree-sitter, and updates the ChromaDB index. The gate condition is your judgment that the code is coherent and functional. There is no automated trigger.
Why not automate it? Because:
- Mid-refactor indexing is not recommended. If the system auto-detected file changes and re-indexed continuously, it would ingest broken syntax and incoherent call chains. You want the index to reflect a working snapshot, not a construction site. Allowing you to pick and choose the section to refactor, fix, or add, query the current memory, prompt your implementation, test it, and then re-index when it's working is a more stable workflow.
- The scan is idempotent. Re-running it overwrites existing records using deterministic hashing. Stale entries from deleted files are purged, and anything listed in
.memignore(zerikai's.gitignoreequivalent, placed in your project root) is skipped during scans. You can scan as often as you want, 50 times in an hour if you're iterating fast, with zero penalty. - It works with or without CI. zerikai_memory doesn't assume you have a CI pipeline, a main branch, or a team workflow. It's built for solo developers and small teams who know their own codebase.
Two-tier re-indexing:
scan_workspace: file-level changes: renamed functions, added methods, updated docstrings.scan_workspace+force_refresh_brief=True: architectural changes: new modules, restructured directories, changed design patterns.
If a scan crashes mid-walk? Re-run it. If the index gets corrupted, zerikai ships with a reset script: run drop_memory.py "Workspace Name" to wipe the ChromaDB collection for that project, then re-scan. A collection backup before scan is on the roadmap, not because it's broken without one, but because "undo" is a reasonable thing to offer before you overwrite an index you're happy with.
Known Limitations (Because If You Don't List Them, Someone Else Will)
tree-sitter is a static parser. It sees syntax, not semantics. This has concrete blind spots:
Dynamic dispatch and polymorphism. obj.method() is parsed as a call, but the actual method bound at runtime (duck typing, virtual dispatch, interface implementation) is unknown. The index records the static signature, not the resolved implementation.
Decorators in Python. @cache, @retry, @authenticate; tree-sitter sees the decorator node but cannot evaluate what it does. The extracted entity is the decorated function's AST, not its runtime behavior. If a decorator wraps the function in a caching layer, the index won't reflect that the function now has a cache.
Cross-file references. tree-sitter works per file. from module import func is parsed as an identifier but not linked to its definition. There's no import resolver, no module graph. This is a deliberate trade-off: the index stores definitions, not a call graph, keeping the memory footprint small. When you query "where is get_supported_extensions used?", you'll get the definition at #code_indexer.py:184 with high confidence. The AI agent then uses that file:line citation as an entry point; it can grep the codebase, find every import or call site, and synthesize the full picture. The memory provides the entry point; the agent does the traversal. You give up one-click cross-file citations in exchange for a leaner index and faster scans.
Monkey-patching and runtime alterations. tree-sitter indexes the original method and the replacement function as separate entities, each with its own docstring. Both are searchable. What tree-sitter cannot interpret is the assignment line that connects them (SomeClass.method = new_func) - it parses the syntax but doesn't understand that it overrides the original. But this is the same pattern as cross-file references: the memory returns both entities with file:line citations, the AI agent follows up with grep, traces the assignment, and answers which one wins. The index provides the entry points; the agent does the resolution.
Cross-cutting concerns. Logging, auth, caching implemented via decorators or middleware chains leave no parse-tree footprint. A query like "which middleware modifies the request?" will return low-relevance hits or nothing.
What I'm doing about it:
- The
entity_docstringfield can document decorator effects if authored carefully. - The project brief, stored in zerikai's local data directory at
.brain/contexts/, captures high-level descriptions of architecture and cross-cutting concerns. save_to_memoryfor edge cases. When the agent resolves something the static indexer cannot (like which method wins a monkey-patch), you can tell it tosave_to_memory(a command that stores a fact directly into the index) and persist that fact. Next query, the answer is already there. No grep needed. The system is deterministic for code, but leaves a manual escape hatch for runtime truths.- Future hybrid search (BM25 + RRF) may improve recall for indirect patterns.
- A runtime-introspection module that feeds the same ChromaDB is on the long-term roadmap: a third tier that maps not just what code exists, but what actually executes.
For now, these are accepted trade-offs. The indexer prioritizes speed, determinism, and offline operation over deep semantic understanding. If your codebase is heavily dynamic (metaclasses, runtime code generation, extensive monkey-patching), zerikai_memory will have blind spots. You should know that going in.
The Bigger Point: Developer Sovereignty
There's an assumption baked into a lot of AI tooling right now: that more automation is always better. That the ideal coding agent should proactively index, update, and adapt without the developer thinking about it.
Not every tool wants you out of the loop. Some were built to keep you in it.
zerikai_memory takes the opposite stance. It gives the AI just enough context to be useful, a clean retrieval of the right function at the right line number, and then hands control back to the developer. You decide when to index. You verify the citations. You're the gate.
I call this "developer sovereignty." It is not Luddism. It is recognizing that code is the ground truth and LLMs are a text synthesis layer on top of it. The LLM does not "understand" your codebase; it retrieves and recombines patterns from it. The more agency you give the retrieval layer, the more you risk it confidently pointing at the wrong line or synthesizing from a broken intermediate state.
There is a practical side to this philosophy. Because zerikai_memory runs entirely on your machine, no source code ever leaves it. The index, the brief, the embeddings, all of it lives in zerikai_memory/.brain/ on your disk. For healthcare, finance, defense, or any regulated industry where sending proprietary code to a cloud vendor is a non-starter, this is not a nice-to-have. It is the only way a memory layer can exist at all.
This isn't a missing feature. It's the design.
zerikai_memory is on GitHub (MIT license). If you're hitting the same context re-injection tax across sessions and IDEs, it'll stop that at the root. If your codebase is predominantly Python, JS/TS, HTML, CSS, or Markdown, it works out of the box. More languages are on the way. Tree-sitter grammars exist for most of them; I just haven't wired them all up yet. Feel free to index zerikai_memory itself and tell it to add a new tree-sitter grammar for Go, Rust, Java, C#, etc.
What am I missing? If you've built or used a codebase memory layer, I want to hear where your trade-offs landed differently.