Updates • rbtres

Start Here

Why I Coordinate 21 AI Agents Instead of Using One

February 27, 2026 Pipeline

Most people use AI as a single assistant. I split the work across 21 specialized agents — each with defined responsibilities, quality gates, and a communication relay. Here’s why, and what it actually looks like in practice.

A single AI agent will confidently approve its own code with the same reasoning it used to write it. That’s the whole problem.

Specialization changes the output in concrete ways. The architect agent designs APIs by thinking about contracts, versioning, and how consumers will use the interface — it never sees implementation details. The engineer agent implements against those contracts without second-guessing the architecture. The reviewer agent reads the code fresh, with no knowledge of what the engineer intended, only what the code actually does. It catches things the author is blind to: unused imports that indicate abandoned approaches, test assertions that verify the mock instead of the behavior, error handling that silently swallows exceptions.

The agents don’t read each other’s code directly. They communicate through a message relay system built on MCP (Model Context Protocol). The planner assigns a story. The engineer picks it up, writes the code, opens a PR, and sends a relay message saying “ready for review.” The reviewer gets that message, pulls the PR, reviews it, and either approves or sends feedback back through the relay. The QA agent runs tests independently. Each agent has its own worktree, its own context, its own CLAUDE.md with role-specific instructions. They share a codebase but never share a context window.

Across 5 concurrent projects — StockPot (React Native inventory management), QuantBot (Python/ML quantitative trading), Dungeon Crawler (JavaScript roguelike), Meet in the Middle (Svelte collaborative meeting planner), and Maze Solver (algorithm visualization) — the system has delivered 400+ stories, generated over 10,000 tests, and maintained a test pass rate that rarely dips below 95%. The 21 agents include planners, architects, engineers, reviewers, QA specialists, and a DBA, spread across project-specific teams.

My role has shifted completely. I haven’t written application code in weeks. Last Tuesday I caught myself reviewing a PR for architectural fit and realized I hadn’t opened a code editor in two weeks. That shift — from writing code to designing systems that write code — turned out to be the whole point. Instead, I make design decisions: which features to prioritize, which architectural patterns to use, when to refactor versus push forward. I review PRs for strategic alignment rather than syntax. I tune the agent configurations when I notice quality drifting. It’s the difference between being a developer and being an engineering manager — except my team works in 200,000-token sessions and never needs a standup meeting.

The system isn’t free of problems. Context drift is real — agents need persistent configuration files and strict startup protocols, or they lose their role identity as the context window fills up. Coordination overhead is real too — relay messages, handoff files, pipeline drain protocols all cost tokens and add moving parts. And tracing multi-agent coordination issues is genuinely complex. When something surfaces, you’re reading through four different agent logs to reconstruct the sequence of events and understand where the process needs refinement.

But the core tradeoff holds: the overhead of coordination is cheaper than the cost of blind spots. A single agent will ship code that looks correct to itself 100% of the time. Twenty-one agents with genuine separation of concerns will catch the issues that matter — the ones that only appear when someone else looks at your work. The AI just makes it possible to run the whole team from a single laptop — and to discover, in real-time, whether your management instincts are any good when execution speed is no longer the bottleneck. Learn more about the system →

Outgrowing a Paid API in One Afternoon

April 4, 2026 MITM

I replaced a paid mapping API with a free open-source alternative in a single session — 21 new tests, zero regressions. The decision framework that made it obvious was more valuable than the migration itself.

Meet in the Middle uses seven different APIs to find meeting points between groups of people. Every one of those APIs was evaluated on three questions before it was adopted: what does it do, what does it cost, and what’s the value of maintaining the dependency versus building it ourselves? That framework paid for itself when Geoapify’s outdoor category coverage turned out to be the weakest link in the entire system.

Here’s what happened. A production audit flagged 46 issues across the app — performance bugs, accuracy problems, missing error handling. One of the findings was that outdoor search results were thin. Users searching for hiking, beaches, or parks were getting sparse results because Geoapify’s outdoor point-of-interest data just isn’t great. It’s a solid geocoding and places service, but trail coverage and outdoor activity data isn’t its strength.

Geoapify was already the tightest bottleneck — 3,000 calls per day on the free tier, the lowest quota of any service in the stack. I was spending the scarcest API budget on the category with the weakest results. That’s the kind of misallocation you only notice when you audit the whole system at once instead of looking at each API in isolation.

The alternative was OpenStreetMap’s Overpass API. Completely free, no hard daily limit, rate-throttled instead of quota-limited. And for outdoor categories specifically — trails, parks, beaches, campgrounds, swimming areas, picnic sites — the coverage is dramatically better. OSM has excellent outdoor data because outdoor enthusiasts are exactly the kind of people who contribute to open mapping projects.

The migration moved seven outdoor categories from Geoapify to Overpass. The engineer built a new Overpass query module, added a category routing function that sends outdoor searches to Overpass and everything else to Geoapify, and wrote 21 tests covering the new code paths. The whole migration shipped in one PR with zero regressions. The tradeoff: Overpass queries are slower than a REST API during peak OSM usage, and the query syntax is harder to maintain than Geoapify’s clean parameter-based calls. For outdoor categories specifically, the coverage improvement was worth both costs.

The API swap was only the first step. The production audit also revealed that OpenRouteService was overestimating drive times by 20–40% — a known limitation of OSM road data quality. A paid routing service might hide this behind proprietary calibration. With a free API, I owned the problem. So I built the calibration myself: a global correction factor that recovered 15% of search area the budget filter was incorrectly excluding, then a per-route heuristic that applies different corrections for highway, mixed, and local drives. A flat multiplier doesn’t cut it when a 200km motorway trip overestimates differently than a 10km city route. That same rework led to a minimax fairness algorithm — sorting meeting points by worst-case drive time so no one person gets stuck with a 90-minute commute while everyone else drives 15 — and a centroid-based search that cut API calls for groups by 43%.

None of that would have been possible with a vendor’s black-box routing. When I own the calibration layer, I can tune accuracy for my use case, add fairness constraints no vendor considered, and optimize call patterns to stay within free-tier limits. The migration wasn’t just about saving money — it was about gaining control over the quality of the output. Here’s the decision framework that made each choice along the way obvious.

Every API in the system gets evaluated on three dimensions. First, what does it do — not what the marketing page says, but what it actually returns for my specific use cases. I test with production-scale inputs before committing — something you only learn by comparing real routes against ground truth, not by reading the API docs.

Second, what does it cost — and not just the sticker price. I calculate calls per search, daily capacity given my caching strategy, and the scaling path if usage grows. D1 caching at the edge gives us 30-day TTL on geocodes and 4-hour TTL on routes, which stretches Geoapify’s 3,000 daily calls to roughly 500 unique searches. The real cost isn’t the API price — it’s the API price divided by your cache hit rate.

Third, what’s the value of the dependency. For each API, I ask whether the dependency earns its place or whether I’d be better off owning the logic. Map tiles? Self-hosted on R2 — a tile service adds no value over serving files from object storage. Routing isochrones? Keep the dependency — self-hosting Valhalla would cost $50–100/month for complexity I don’t need to own. Drive time calibration? Build it — no vendor offers the fairness model my use case requires. Each decision has a different answer, and the framework makes it systematic rather than ad hoc.

The result: Meet in the Middle runs at $0/month across seven APIs, with clear upgrade paths for each bottleneck if usage demands it. Every API choice is documented with its cost, its free-tier limit, and the specific condition under which I’d replace or upgrade it. That documentation is what made the Overpass migration a one-afternoon task instead of a research project — the comparison had already been done.

The lesson isn’t “use free APIs.” It’s “evaluate every dependency continuously.” An API that was the right choice at adoption can become the wrong choice as your data needs evolve. The only way to catch that is to periodically re-ask those three questions: what does it do for us now, what does it cost us now, and is the dependency still earning its place? See how the full system works →

Teaching a Phone to Read Your Grandmother’s Recipe Card

April 12, 2026 StockPot

StockPot’s OCR pipeline went from “works on printed text” to “reads handwritten recipe cards in bad lighting.” The journey involved dual OCR engines, OpenCV preprocessing, and the realization that the hardest part of recipe capture isn’t text recognition — it’s knowing what to do with the text once you have it.

The original OCR in StockPot used ML Kit and called it done. Point your camera at a printed cookbook page, get text back, parse it into a recipe. It worked — for printed text, in good lighting, on a flat surface. The moment someone tried a handwritten recipe card, a photographed page from a spiral-bound cookbook, or a screenshot of a recipe texted from a friend, accuracy cratered.

Phase 4B rebuilt the pipeline from the ground up. The first insight was that one OCR engine isn’t enough. Apple’s Vision framework handles handwriting recognition significantly better than ML Kit — it was trained on handwritten text and understands cursive, mixed case, and irregular spacing. ML Kit is faster and better at structured printed text. So StockPot now runs both: Apple Vision on iOS for handwritten input, ML Kit as the fallback for printed text on both platforms.

But the real accuracy gains came before the text ever reached an OCR engine. OpenCV preprocessing transforms the raw camera image through a pipeline of corrections: automatic skew detection and rotation (cookbook pages are never perfectly aligned), adaptive contrast enhancement (kitchen lighting is terrible), noise reduction (phone cameras in low light produce grainy images), and binarization (converting to black-and-white so the OCR engine sees clean text against a clean background). These preprocessing steps improved recognition accuracy more than switching OCR engines did.

The hardest problem wasn’t recognizing the text — it was understanding the structure. A recipe card doesn’t have HTML headers marking “Ingredients” and “Instructions.” It might have a title in larger handwriting, ingredients in a column, and instructions in paragraph form. Or ingredients and instructions might be mixed together. Or there’s no title at all — just “Mom’s Soup” scrawled at the top. The parsing logic uses heuristics: lines with quantities and units (2 cups, 1 tbsp) are probably ingredients. Numbered lines or lines starting with verbs (mix, bake, stir) are probably instructions. Everything else is metadata. It’s not perfect, but it gets close enough that editing is faster than typing from scratch.

The testing challenge was unique. You can’t unit-test OCR accuracy without real images, and real images are messy. We built a test data set of 30+ recipe images — printed, handwritten, photographed at angles, in different lighting conditions — and wrote integration tests that measure accuracy against known-good transcriptions. The pipeline passes when 95% of characters are recognized correctly across the full test set. That threshold forced us to fix real problems instead of cherry-picking easy cases.

Six specialist auditors reviewed the implementation: math validation on the preprocessing algorithms, ML review on the engine selection logic, security review on image handling, mobile performance analysis, code quality, and architecture assessment. Average audit score: 8.5/10. The mobile performance reviewer caught a memory leak in the OpenCV preprocessing pipeline that would have crashed the app on older devices with large images — the kind of bug that only surfaces under real-world conditions, not in unit tests. See the full pipeline →

270 Pull Requests and What I’d Build Differently

April 10, 2026 Pipeline

After 270+ merged PRs across 5 projects, the pipeline works. But if I started over tomorrow, I’d change three things on day one — and none of them are about the AI models.

The pipeline has shipped 400+ stories across 5 projects. 270+ pull requests merged, each one through design, implementation, code review, and testing. By most measures, it works. But “works” and “works well” are different things, and 270 PRs of data makes the gaps obvious.

Thing I’d change #1: Start with the relay, not the agents. The first version of the pipeline had agents reading each other’s files to communicate. The planner would write a story assignment to a file, the engineer would poll for it, the reviewer would check a different file for PR numbers. It was fragile, race-condition-prone, and impossible to debug. The MCP-based relay system I built later — structured messages with sender, recipient, and content, delivered through a central server — fixed all of that. But I spent weeks working around file-based communication before building the relay. If I started over, the relay would be the first thing I built, before a single agent existed. Communication infrastructure isn’t a nice-to-have; it’s the foundation everything else depends on.

Thing I’d change #2: Enforce single-story WIP limits from day one. Early sessions had engineers working on 2–3 stories simultaneously. It felt faster. It wasn’t. Context switching between stories in the same context window produced more bugs, more rework, and more reviewer churn than sequential single-story focus. The WIP limit of 1 story per engineer isn’t a process preference — it’s a technical constraint. LLMs degrade when they hold multiple implementation contexts simultaneously, the same way human engineers do. The difference is that humans know they’re context-switching and can compensate. Agents don’t. I learned this the hard way through 30+ PRs of post-hoc cleanup on StockPot.

Thing I’d change #3: Build the checkpoint system before the first PR. The checkpoint system — handoff files, activity logs, metrics, status files that persist across sessions — was retrofitted after the pipeline was already running. That meant early sessions had no continuity. Agents would start fresh, re-read everything, make decisions that contradicted previous sessions because they had no memory of them. The checkpoint system now saves 5–10k tokens per session in avoided re-reads and gives every agent a running history of what’s been decided. Building it on day one instead of session 20 would have saved more tokens than any other single optimization.

The meta-lesson isn’t specific to AI pipelines. It’s the same lesson every engineering team learns: invest in infrastructure early, enforce process constraints before they seem necessary, and build communication systems before you need them. The cost of retrofitting is always higher than the cost of building it right the first time. The only difference with AI agents is that the feedback loop is faster — you discover your infrastructure gaps in days instead of quarters. See the full system →

Why I Stopped Giving My AI Engineer a Quality Checklist

April 2, 2026 Pipeline

I added 20 quality rules to the engineer’s instructions. Quality got worse. Replacing the checklist with five specific test descriptions from the architect cut per-PR cost by 31% and produced cleaner code. Specificity beats volume every time.

After a costly quality audit on StockPot — 30 PRs of post-hoc cleanup that consumed 57% of that sprint’s token budget — I did what seemed logical: I added quality rules to the engineer agent’s instructions. Eight checklist items. Twenty rules total. Check for shared components. Use design tokens. Wire up error handling. Register in the schema. The list was comprehensive and correct. And it made things worse.

The engineer agent started spending more tokens per story, not fewer. Fix cycle rates didn’t improve. The code wasn’t measurably better. What was happening was prompt bloat: 20 rules competing for attention in a context window meant the engineer was reasoning about compliance instead of reasoning about the problem. Cognitive load isn’t just a human issue. When you give an LLM too many constraints to satisfy simultaneously, the quality of its reasoning on any individual constraint degrades.

The fix came from thinking about what actually works in human teams. In my own experience managing this pipeline, specific test specifications reduced regressions by roughly 70%, while generic quality rules added as checklists actually increased regressions by about 42%. The mechanism is intuitive once you see it: a specific test (“render the meal plan screen, assert the loading skeleton appears exactly once, then assert data renders”) gives the engineer a concrete target. A generic rule (“ensure proper loading states”) gives the engineer an ambiguous obligation that competes with every other ambiguous obligation in the prompt.

So I restructured the pipeline. The architect agent now writes a “Required Tests” section for every story — literal test descriptions in action-then-assertion format. Not “test the meal plan screen.” Instead: “render MealPlanScreen, assert loading skeleton appears exactly once, assert data renders. Tap Add Recipe, assert RecipePicker modal opens with correct household recipes. Select recipe, assert meal plan cache key updated optimistically. Navigate away and back, assert single load with no re-fetch flash.”

The engineer writes those test skeletons before writing any production code. Not test-driven development in the dogmatic sense — the tests don’t need to compile first. But the engineer reads the architect’s test specs, writes skeleton test files with the exact assertions described, and then implements the production code to make them pass. The tests define what “done” means before a single line of feature code exists.

The reviewer’s job changed too. Instead of checking 20 quality items, the reviewer now has one critical gate: test quality. If the test files contain readFileSync (testing source strings instead of behavior), source.toContain (same problem), or it() blocks with no expect() (assertions that test nothing) — the PR is auto-failed at a maximum confidence of 3 out of 10. No discussion. Those patterns indicate the tests are decorative rather than functional, and decorative tests are worse than no tests because they create false confidence.

On QuantBot, the results were immediate. Per-PR cost dropped from roughly 42,000 tokens to 29,000 — a 31% reduction. The engineer was writing less throwaway code because the test specs defined the target upfront. The reviewer was spending less time on generic quality nitpicks because the meaningful quality signal — do the tests actually verify behavior? — was binary and fast to check. Fix cycles dropped because the architect’s test specs caught design misunderstandings before implementation, not after review.

The merge threshold changed too. I dropped it from 10/10 to 8/10 with zero blockers. Quality now comes from the architect’s test specs and the engineer’s tests-first workflow, not from a reviewer trying to assign a perfect score to imperfect code.

The broader lesson applies beyond AI agents: quality rules don’t scale linearly. Adding more rules past a threshold degrades the system instead of improving it. What scales is specificity — telling someone exactly what “done” looks like, with concrete examples they can implement against. The architect has the design intent. The engineer has the implementation skill. The 20-rule checklist was asking the engineer to also have the architect’s judgment. Removing that impossible expectation and replacing it with the architect’s actual judgment, expressed as test specs, was the fix. See the full pipeline →

From Speed to Scale: How Specialized Audits Changed Everything

March 28, 2026 Pipeline

I ran my first code audit the obvious way: one AI agent, one giant codebase, a generic prompt. The findings were useless. “Consider adding error handling.” Thanks. I already knew that.

Six months and a lot of wasted tokens later, a single specialized audit found more real bugs in one pass than every previous scan combined. Same model. Same codebase. The only difference was how I framed the question.

The breakthrough came when I specialized. For StockPot, my React Native inventory management app, I deployed six targeted auditors: a DBA reviewing query patterns, a mobile performance specialist, a code quality analyst, a security reviewer, a UX auditor, and an architecture assessor. Each one operated with a defined role, a specific lens, and a scoped set of files to examine.

The results were dramatically different — and each auditor found issues that the others completely missed. The DBA identified query patterns in the data layer that would degrade at scale: missing indexes on high-frequency lookups, N+1 queries hiding behind ORM abstractions, and transaction scoping that would cause contention under concurrent writes. The mobile performance specialist found sync engine lifecycle issues tied specifically to React Native’s navigation model — state machines that did not account for screen unmount timing during background sync operations. The security reviewer surfaced authentication race conditions in the token refresh flow. That one kept me up. I’d reviewed that module myself and missed it. And the code quality scan identified five duplicate patterns, each spanning 60+ lines, scattered across modules that had evolved independently.

None of these findings would have surfaced from a generic review. A general-purpose code scan does not think like a DBA. It does not understand that a query pattern which works fine at 100 records will lock tables at 100,000. It does not know that React Native’s navigation lifecycle has different cleanup semantics than a web SPA.

But how specialized should each auditor be? Too broad and the findings are generic. Too narrow and you need fifty auditors to cover a single codebase. The sweet spot I found was a three-part definition: the auditor’s role (who they are), their lens (what they are looking for), and their scope (which files and patterns to focus on). A “DBA reviewing query patterns in the data layer” produces fundamentally different output than a “general code reviewer looking at the same files.” The role primes the model’s domain knowledge. The lens focuses attention. The scope prevents dilution.

Each audit cycle fed the next one. Round 1 audits identified recurring categories of findings, which I formalized into a Bug Taxonomy template — a structured classification system so the same issue types get tracked consistently across projects. Those patterns then informed new quality gates: the pipeline now requires runtime assertions at design time, so the architect agent builds validation into the story specification before the engineer ever writes a line of code. The quality gates, in turn, reduced the need for the same audits in subsequent phases. Each cycle made the next one cheaper and faster.

Quality scores went from 7.1 to 9.4. Per-story token costs dropped 42%. And I stopped writing audit prompts that started with “review this code” — because “you are a DBA reviewing queries” outperforms every generic prompt I’ve ever written. See how the full pipeline works →

When Your AI Agents Forget How to Be Themselves

March 15, 2026 Pipeline

Sessions 1 and 2 worked perfectly. Session 3, with no code changes, fell apart. Tracing the root cause cost 200,000 tokens — and changed how I think about AI memory.

As the system scaled past its first dozen sessions, the planner agent started launching other agents with broken configurations — missing paths, missing relay IDs, missing everything it had gotten right the session before. Here’s the setup: a planner agent coordinates 21 specialized agents across 5 projects. Each agent needs a specific configuration bundle to function — identity file, relay ID, repo path, worktree settings, session handoff. Get any of those wrong and the agent either can’t find its files or silently works in the wrong directory.

Sessions 1 and 2 were flawless. The planner launched every agent with the full configuration template — correct model, correct paths, correct relay IDs, all of it. Stories got assigned, code got written, PRs got reviewed and merged. The system worked exactly as designed.

Session 3 is where the system revealed its first real design constraint. The planner started launching agents with bare-bones prompts — just a name and a vague task description. No CLAUDE.md path. No relay ID. No code repo path. No worktree configuration. The engineer agent couldn’t find the codebase. The reviewer couldn’t communicate results back. The QA agent worked in the wrong directory entirely. Tracing the root cause took roughly 200,000 tokens — an investment that paid for itself by uncovering a fundamental insight about how context windows actually work.

Nothing had changed in the system. What changed was the context window. In sessions 1 and 2, the full launch templates were right there in the planner’s active context — it had just read them from the CLAUDE.md file, they were fresh and prominent. The planner wasn’t “understanding” how to launch agents. It was mimicking the patterns it could see. And it mimicked them perfectly.

By session 3, the conversation had grown. Older messages got compressed. The detailed launch templates were no longer in the active window. Without those patterns to mimic, the planner reverted to its default behavior — launching minimal agents the way any general-purpose AI would if you just said “start a code review agent.” It wasn’t forgetting. It never learned in the first place. It was performing context-dependent pattern matching, and the patterns had scrolled away.

The fix was treating agent configuration as infrastructure, not assumption. I wrote explicit IRON RULES at the top of each agent’s CLAUDE.md — rules that get read fresh every session, not buried in conversation history. I created detailed launch templates with every parameter spelled out, stored in persistent files. I built handoff.md files with copy-paste-ready configuration tables so the next session never has to reconstruct anything from memory.

If you’re building multi-agent systems, don’t rely on agents “remembering” how things work. Pattern matching looks identical to understanding right up until the pattern scrolls out of view — then it fails silently. Bake your critical patterns into persistent files that agents read on every startup. Your documentation isn’t just for humans. It’s your agents’ actual memory.

I spent a day debugging an AI agent’s memory loss, only to realize I was describing my own problem back to myself: I’d assumed the system remembered things it had only ever pattern-matched.

The system now runs 21 specialized agents across 5 projects, and every one of them reads its full identity on startup. See how this pipeline works →

When Your AI Writes Code That Can’t Actually Run

March 10, 2026 QuantBot

QuantBot had 668 passing tests across a complete quantitative trading system — and it couldn’t connect to a broker. An architect assessment revealed three critical modules were entirely stubbed with NotImplementedError. Tests validate logic, not integration.

The Debugging Tax: What Breaks When 21 AI Agents Share a Codebase

March 8, 2026 Pipeline

A relative path bug took two agents and three test runs to find. A prematurely shut-down reviewer left a PR unreviewed for an entire session. And GitHub won’t let you approve your own PR, even via API. Welcome to multi-agent coordination.

Three bugs. Each one invisible at the single-agent level. Each one a different category of “nobody warned me about this.”

The most subtle coordination challenge was path resolution. An engineer agent wrote Path("data/state.json") as a module-level constant — perfectly valid when tests run from the project root. But when the scheduler runs the bot from a different directory, the relative path resolves wrong. The fix was converting to instance attributes that use a configured base path. It took three test runs across two agents to identify and resolve.

One session, I shut down the reviewer agent while a PR was still in-flight. The result: an unreviewed PR that carried over to the next session. Now the pipeline runs a “pipeline drain” protocol — the engineer stops first, I wait for all in-flight PRs to clear the review and QA gates, and only then shut down the pipeline agents in order.

GitHub’s API won’t let you approve your own PR, even programmatically. On a single-account repo where all agents push from the same GitHub identity, gh pr review --approve silently fails. Three sessions. I spent three sessions finding a workaround for a limitation that GitHub documents nowhere and returns no error for. The API just quietly does nothing. The solution: the reviewer leaves a comment and merges directly after QA passes, skipping the formal GitHub approval.

All three problems are invisible at the single-agent level. Path resolution patterns don’t surface until a different agent runs from a different directory. Pipeline timing issues don’t emerge when one agent does everything sequentially. API constraints don’t matter until multiple agents share an identity. These three bugs cost 200,000 tokens to find — roughly 15% of that session’s budget. Expensive lessons. But each one only had to be learned once, because each one became a process rule that prevents the same failure from recurring. The debugging tax is real, but it’s a one-time payment. Read more lessons learned →

What 10,000+ Automated Tests Taught Me About AI Code Quality

February 24, 2026 Insights

AI writes tests fast. After 10,000+ tests across five projects, I can tell you exactly what they’re missing — and it’s always the same thing.

AI agents write tests fast. In the first two weeks, my pipeline generated thousands of tests with healthy coverage numbers. Every PR arrived with a full suite. I thought the testing problem was solved. It wasn’t.

The depth took more iteration. The initial tests were passing because they covered the straightforward cases. A function that adds two numbers? Tested. A function that handles a dropped database connection mid-transaction while recovering from a partial write? Not tested. And that second category is where production bugs actually live.

Dungeon Crawler has 1,002 tests covering a JavaScript roguelike game engine. The tests the AI wrote first were straightforward: does the player move right when you press the right arrow? Does damage calculation return the correct value for a basic attack? Useful, but shallow. The real value came from combat edge cases — what happens when a boss enters a second phase while the player is mid-animation? What about pathfinding through a door that opened the same tick the enemy calculated its route? Those required explicit, targeted test cases that the AI wouldn’t write unless prompted.

Same pattern, different domain. In Meet in the Middle, the valuable tests weren’t the component renders — they were the geolocation mocking edge cases and the async race conditions. The project has 1,168 tests. The most valuable ones test geolocation mocking (what happens when the browser returns a cached location from three hours ago?), async API race conditions (two users searching simultaneously with different network latencies), and state management edge cases around component remount during navigation. The AI’s first instinct was to mock everything and test that mocks were called correctly. I had to push for tests that verify actual behavior through realistic scenarios.

QuantBot, a Python quantitative trading system, introduced a different dimension: ML pipeline validation. Testing that a model trains and produces predictions is trivial. Testing that walk-forward cross-validation actually prevents data leakage, that ensemble stacking weights converge across different market regimes, that the feature engineering pipeline handles missing data without silently filling zeros — those tests required domain understanding that the AI didn’t have on its own.

The test-to-code ratio across all projects exceeds 1:1 — more lines of test code than application code. Here’s why that ratio matters. Every test is a permanent regression guard. When Agent A refactors a module, Agent B’s tests immediately catch if the behavior changed. The tests aren’t just validating the current implementation. They’re defining a contract that survives across agent sessions, context windows, and code rewrites.

What actually moved the needle was process, not prompting. I added mandatory test coverage gates in the CI pipeline — PRs can’t merge if coverage drops below the threshold. The reviewer agent now verifies test quality, not just test presence: are edge cases covered? Are assertions checking behavior or just checking that mocks were called? Does the test break if the implementation changes but the behavior stays the same?

The AI writes the first draft of every test suite. It’s reliably good at the obvious cases. But the test that catches a real production bug? That one almost always came from a reviewer asking “what happens if this fails halfway through?” See the projects →

Building a $0/Month Production App with Cloudflare

February 22, 2026 MITM

Meet in the Middle — a collaborative meeting point finder built with Svelte — runs entirely on Cloudflare’s free tier. Pages for hosting, Workers for API proxying, D1 for caching, R2 for map tiles. Total monthly cost: zero dollars.

The budget for this project was zero dollars a month. Not “keep it cheap” — literally zero. That constraint turned out to be the best architectural decision I made, because it eliminated an entire category of lazy choices.

D1 caching reduces external API calls by 80%. The Worker proxies and rate-limits requests to protect free-tier quotas on Geoapify and OpenRouteService. Map tiles are self-hosted PMTiles on R2 instead of paying for a tile service. The entire frontend is static HTML served from Pages’ CDN.

If that sounds like a lot of moving parts for “free” — it is. But each piece exists because the free-tier constraint demanded it.

D1 is a serverless SQLite database — it caches geocoding results and route calculations so repeat queries never hit the external APIs. R2 is object storage, used here to serve PMTiles map tile archives at CDN speed without per-request costs. Workers act as an API proxy layer, handling authentication, rate limiting, and request transformation between the Svelte frontend and the third-party services. Pages hosts the static Svelte build with automatic deployments from the GitHub repo.

The project has 1,168 tests covering everything from geolocation edge cases to async API race conditions, all running on Vitest. The Svelte frontend handles real-time collaborative sessions where multiple users share locations and find optimal meeting points. No server to maintain, no database to back up, no surprise bills.

The constraint of zero cost forced better engineering decisions than having a budget would have. When every API call costs against your free-tier quota, you design aggressive caching. When you can’t pay for a tile service, you learn PMTiles and self-host. The hardest tradeoff was D1’s 5ms CPU time limit per Worker invocation — I had to restructure the caching logic twice to avoid hitting it on complex geocoding lookups. I also gave up server-side rendering entirely; everything is static HTML, which means no dynamic SEO previews for shared meeting links. That stung.

The result is an architecture that handles thousands of users before spending a cent — and I’d build it the same way even with a budget, because the caching and self-hosting decisions made it faster, not just cheaper. Try it at mitm.rbtres.com.

The Pipeline That Ships Features While I Sleep

February 20, 2026 Pipeline

I want to be honest: the pipeline doesn’t run while I sleep. It runs while I make decisions, which is harder than it sounds and more valuable than writing code.

Here’s what a session actually looks like. I start the planner. It reads the backlog, picks stories, and assigns them to project teams. The architect writes a design doc. The engineer implements against it and opens a PR. The reviewer reads the diff cold — no context about what the engineer intended, just what the code does. QA runs the suite. If anything fails, it bounces back with specific feedback. I don’t touch any of it.

The title of this post is slightly misleading — the pipeline doesn’t literally run while I sleep. It runs during coordinated sessions where a planner agent orchestrates the entire workflow. But “without me writing a single line of code” is accurate. My role during a session is to start the planner, make high-level design decisions if the architect flags a question, and review the output at the end. The agents handle everything in between: breaking features into stories, assigning work, writing code, reviewing PRs, running tests, and merging.

Across all projects, the system has delivered 400+ stories. In one multi-day session, 61 PRs were merged. Every one went through the full pipeline: engineer, reviewer, QA, merge. No shortcuts. The agents communicate through the relay system, sending structured messages like “PR #47 ready for review” or “QA failed: 3 test regressions in combat module.” No agent reads another agent’s code directly or shares a context window.

The pipeline now runs in parallel across all five projects. Each project has its own agent team, its own backlog, and its own velocity. The planner knows all five backlogs and can shift resources when one project blocks — something I could never do manually across five codebases without dropping context.

The uncomfortable realization isn’t that AI can write code. It’s that most of what I used to call “work” was execution, not judgment — and I didn’t know the difference until the execution was automated. See the full system architecture →