A developer's personal LLM knowledge base — capturing, connecting, and retrieving everything learned over months of building with AI.
Every insight — terminal output, article highlight, Slack thread, or shower thought — flows through a 4-stage pipeline before it's queryable.
CLI tool, browser extension, API webhook, or mobile quick-capture
LLM extracts entities, assigns type, generates title & summary
Text → 1536-dim vector via text-embedding-3-small, stored in Qdrant
Auto-connect to related notes via cosine similarity > 0.82 threshold
Organic clusters formed by semantic proximity — not manual folders. The map shows how domains overlap and evolve.
287 notes — attention mechanisms, KV caches, quantization tricks, fine-tuning logs, prompt engineering patterns.
194 notes — distributed systems, event sourcing, CQRS, database internals, consistency models, queue architectures.
156 notes — ownership model, async runtime, SIMD, zero-copy patterns, unsafe ergonomics, compiler diagnostics.
203 notes — chunking strategies, hybrid search, re-ranking, HyDE, metadata filtering, evaluation frameworks.
138 notes — Kubernetes patterns, Nix flakes, CI/CD pipelines, observability, cost optimization, GPU scheduling.
89 notes — user research, growth loops, pricing models, developer experience, onboarding friction analysis.
Each note is auto-classified into one of five types, shaping how it's stored, linked, and surfaced during retrieval.
PagedAttention for serving.AsyncIterator wrapper around the SSE stream. Apply a bounded channel (capacity=64) between the HTTP reader and the consumer coroutine. When the channel is full, the reader awaits — this naturally applies TCP backpressure upstream to the OpenAI API, preventing OOM on slow clients.Natural language queries against the knowledge base — showing real results with similarity scores and source attribution.
Use an AsyncIterator wrapper around the SSE stream with a bounded channel (capacity=64). When full, the reader awaits — applying TCP backpressure upstream to the API.
Backpressure patterns in async Rust: tokio::sync::mpsc with bounded capacity, or use Semaphore to limit concurrent in-flight requests.
256 tokens with 10% overlap is the sweet spot. Chunk size explained 3× more variance in retrieval quality than embedding model choice across 12 models tested.
LlamaIndex benchmark: recursive character splitter vs. semantic chunking. Fixed-size with overlap won on NDCG@10 for structured docs; semantic chunking edged ahead on narrative text.
Leader election uses randomized timeouts (150–300ms). Safety guarantee: at most one leader per term. Decomposing into election + replication + safety makes Raft implementable vs. Paxos.
In practice, Raft's leader lease optimization avoids read-quorum overhead. etcd uses this for linearizable reads without full consensus on every GET.
Every Sunday, the system generates a review: new notes, emerging connections, stale clusters, and knowledge gaps to explore.
Added 8 notes from "Attention Is All You Need" re-read. Linked 3 to existing KV cache cluster.
Captured debugging session — CUDA OOM during LoRA merge. Root cause: accumulated optimizer states.
5 notes from Kubernetes sig-scheduling meeting. New cluster forming: "GPU scheduling policies."
Insight note: realized our re-ranker adds 120ms latency for only 2% NDCG gain. Flagged for removal.
3 snippet captures from pair programming — async Rust patterns for graceful shutdown.
Review session: merged 2 overlapping clusters, archived 14 stale notes, opened 3 new questions.