AgentGem · Engineering Journal

Building AgentGem with Claude Code

An empty repo to a public, npm-published product in eleven days: a registry, six deploy targets, a marketing site, a desktop app. Most of it driven through a coding agent. If you're trying to get real work out of one, this is less a launch story than a field manual for how to steer it.

Reconstructed from 214 session transcripts · 326 commits · 24 design specs · Jun 14–25, 2026

TL;DR · if you only skim

A real product (registry, six deploy targets, a site, a desktop app) shipped to npm in 11 days, mostly driven through a coding agent.
The method in one breath: spec before code, an adversarial reviewer on every diff, persistent cross-session memory, one git worktree per session, and verify against the real world.
The honest part: where the agent fooled itself, including stale compiled tests passing green and parallel agents deleting each other's specs.
Want just the rules? Jump to the playbook.

AgentGem packages a coding agent's hard-won configuration (skills, MCP servers, CLAUDE.md) into a portable, composable Gem. What follows is the eleven days that built it, told from the inside. The product packages coding agents; it was itself made by working closely with one, so the two stories don't really come apart. Read it for the parts that transfer: the loop that kept the agent on-rails, the guardrails that caught it when it drifted, and the failure modes nobody warns you about until you hit them.

00The shape of the work

One fact frames everything. Across eleven days the project's history holds 214 Claude Code sessions. Roughly 180 of them run a single recurring task, Review this change for security vulnerabilities, fired automatically on nearly every diff. Only about twenty are the "let's build X" conversations a human would recognize as milestones.

180⁄214

sessions were automated adversarial security review. The visible features (a website, a registry, a desktop app) rode on top of an invisible verification loop that never let up. The headline work is the tip; the review automation is the iceberg.

The transferable bit is right there: don't bolt review on at the end. Wire an adversarial reviewer into the loop so it runs on every diff, automatically, while the context is still fresh. An agent will happily hand you code that looks right. A second agent told to attack that code is how you find out whether it is.

01What it became: the Gem pipeline

Strip away the eleven days of iteration and the product is one clean pipeline. Read a local agent's config, redact every secret at the moment of capture, crystallize it into a neutral on-disk Gem archive, then fan that one source out to deploy targets and a registry without ever re-reading raw config.

Fig 1 — The product. Capture → redact → crystallize → distribute. The Gem archive is the one neutral source every target and the registry consume.

02The eleven days

Day 0–2 · the genesis

The very first commit, on June 14, isn't code. It's a design spec. That set the pattern the whole project kept: spec first, code second, 24 of them by the end. The founding intuition was a workflow. You install a coding agent, configure skills and MCP servers and hooks, tune it until it actually works, and then that hard-won setup is stuck on one machine. AgentGem answers the "and then what?" Two constraints held from the first hour and never bent: a browser can't read your ~/.claude, so a small local server bridges it; and secrets get redacted the instant they're read, by value and by key name.

Day 4 · the pivot to a format

What about turning pack.json into an archive/fs with a manifest for agents? I'm aiming for flue, eve, and claude-managed agents, and codex.

This is where "pack" stopped being a blob and became a format: a human-authored manifest, a generated lock with per-file SHA-256 and a signable digest, the bodies as real files on disk. One decision, serialize at the edges, made everything after it cheap. The archive stays neutral, shaped like no single agent, and every consumer reads it the same way. Targets then multiplied. Eve became the reference pattern, Flue added the reusable compose hook, OpenAI Sandbox reused it. A habit shows up here that the project kept: read the real docs and pin versions from the real tarballs, because guessing at an API surface is how you ship code that won't compile.

Day 5–6 · research as steering, and the registry

Here Claude Code worked less as a typist than as a research analyst that could change direction. A shadcn thread about treating a GitHub repo as a distribution protocol became the Gem Registry the next day, built with subagent-driven TDD across ten tasks, reviewed per task and again across the whole branch. The same stretch shipped the testbed-first inversion. Instead of "introspect your machine," the canonical surface became a local .claude/ project you test-drive with your own Claude Code, then package. Packaging took zero new code; it reused the archive's selection model.

Day 7–8 · meet users where they are

The testbed grew past Claude to also cover Codex and Hermes. One finding was worth pinning to memory: Hermes has no MCP-server concept at all, so it's permanently skip-reported rather than deferred. That's the gap between "we haven't built it" and "it can't exist," and only reading the real ~/.hermes/config.yaml surfaced it.

Day 9 · going public

A website, a README, and the open-source prep. One lesson from the git side is worth saving to memory: a force-push doesn't remove unreachable commits from GitHub's object store, so if you ever need history truly gone, a force-push isn't enough. AgentGem went public and shipped to npm as @ninemind/agentgem. The website taught its own lesson. Lead with the outcome, not the build process. A five-step workflow rail read as intimidating, so it became a three-beat You've got an agent → Share it → Profit.

Putting the site online turned into its own small story. agentgem.ninemind.ai runs as a Cloudflare Worker (agentgem-web) that serves the built pages straight from Workers Static Assets, with one twist: a request carrying Accept: text/markdown gets a hand-authored markdown twin of the page instead of the HTML, so a browser sees the real site and an LLM reading the docs gets clean prose. The subdomain was brand new with no DNS record, so the Worker config sets custom_domain = true; one wrangler deploy then provisions the proxied DNS record and the edge TLS cert together, with no manual DNS step. A GitHub Action redeploys on every push to website/ or docs/, authenticated by a scoped Cloudflare API token (Workers Scripts edit, plus DNS and SSL edit on the zone). Cloudflare actually shows up twice in the project, because it's also a deploy target: a built Gem can be pushed to Cloudflare Workers via flue build --target cloudflare and the same wrangler deploy.

Day 10–11 · strategy, a desktop app, and the agent inside the agent

The final stretch widened into product strategy, a marketplace question and other direction calls, while still shipping: an Electron desktop app in its own worktree, an A2A target, and a deliberately restrained identity decision. AgentGem declares; it does not enforce. So don't build a primitive it doesn't need yet. Knowing what not to build counts as a result.

The desktop app deserves a closer look, because it shows what "add a native app" should mean when you hand it to an agent. The brief was deliberately narrow: a thin host, not a UI rewrite. Electron's main process loads the already-built core, calls the same createApp(port) the CLI uses on an OS-assigned localhost port, and points a window at it. REST, MCP, and the API explorer keep working untouched; the renderer is the existing web page. The native folder picker, menu, tray, and auto-update get bolted on through a preload bridge, and nothing else moves. The whole thing lives in an isolated desktop/ folder with its own node_modules, so Electron's heavy binaries never leak into the lean library that publishes to npm. A cross-platform GitHub Action then builds macOS, Windows, and Linux installers in a matrix. Even there a small gotcha surfaced: electron-builder's default release tag (v{version}) collides with the npm core's tag, so the desktop release rides its own desktop-v* tag. The lesson for driving work like this: constrain the agent to wrap what already exists. "Reuse the server, add a window" is a scope an agent nails; "build a desktop app" invites it to reinvent three things you already shipped.

03How it was built: the Claude Code loop

Underneath the features is a repeatable method, and if you take one thing from this piece, take this. It isn't AgentGem-specific. Every significant change moved through the same loop, isolated in its own git worktree, with persistent memory carrying conclusions between sessions and an adversarial reviewer gating the merge. The agent does the typing; the loop is what keeps the typing honest.

Fig 2 — The method. Spec → plan → subagent TDD → adversarial review → merge, looped inside a worktree, with memory and real-world verification feeding every stage.

The loop above isn't abstract; it ran on a specific, mostly off-the-shelf toolchain. The numbers are how many of the 214 sessions each one shows up in.

Skills

• superpowers: brainstorm → plan → TDD → subagent build → review (40)
• /security-review, on every diff (214)
• browser-harness, live UX checks (15)
• using-git-worktrees (21)

MCP & protocols

• context7, to pin real library docs (55)
• ACP: claude-agent-acp / codex-acp (7)
• AgentBack, the app's own MCP surface
• persistent cross-session memory

CLI & ops

• pnpm · vitest · tsc -b (148)
• wrangler · vercel · agentcore deploys
• git filter-repo --mailmap, history rewrite
• gh · npm publish

Two traps the agent won't flag for you

The memory files carry one scar over and over. When two sessions shared a single checkout, branches got scrambled, and one agent twice deleted another session's design spec as "scope creep." Agents working in parallel will step on each other with total confidence. The rule of one worktree per session was earned, not assumed.

The subtler trap cost more. The test runner executed compiled tests from dist/, and tsc -b is incremental, so after a big rename the stale compiled tests lingered and kept passing against the old code. A green check that means nothing. The agent won't catch this, because from where it sits the suite is passing. The fix is a habit you have to teach it (rm -rf dist && rebuild), and the real lesson is older than agents: a passing suite only means what you think it means if you know how the suite is wired.

04The agent inside the agent

The most reflexive feature closes the loop completely. Analyze reads your session transcripts and computes a deterministic usage signal, counting only the tools you actually invoked rather than the whole available-but-unused catalog. Then it asks a local coding agent, over ACP, which artifacts to bundle. So AgentGem packages coding agents by consulting a coding agent. A guard drops anything the agent names that isn't really in your inventory.

That last sentence is the pattern worth copying when you let an agent make decisions. Give it a deterministic signal to reason over, not a blank prompt. Let it propose. Then check its proposal against an authoritative source it can't talk its way around. The agent is good at judgment and bad at staying honest about facts, so you keep the judgment and veto the facts.

Fig 3 — The reflexive loop. A deterministic signal and a local ACP agent agree on a pre-checked selection; the project inventory is the authority that vetoes any hallucination.

05What transfers

None of this is exotic. Strip away AgentGem and what's left is a way of working you can run on your own repo tomorrow. The shortlist:

Spec before code. A one-page design doc per feature is the cheapest way to keep an agent on-rails; it argues with you on the page instead of in the diff. This project wrote 24.
Put the reviewer in the loop, not at the end. An automated adversarial pass on every diff, while context is fresh, catches what a final review never will. It ran ~180 times here.
Give the agent a memory. Write durable decisions and gotchas to a store that outlives the session, or you'll re-explain them next week and re-make the same mistake.
One worktree per session. Parallel agents on a shared checkout will scramble branches and delete each other's work, confidently. Isolation is not optional.
Make it prove things. Drive the real app, run the real deploy, pin versions from real tarballs. "Tests pass" and "it works" are claims until you've watched them.
Pin docs; don't trust recall. A live-docs source (here, context7) stops the agent inventing an API that looked plausible and never existed.
Let it read the outside world. A linked thread became the registry; a vendor's docs became three deploy targets. The agent is a decent research analyst if you point it outward.
Decide what not to build. The most valuable design session of the eleven days ended in "don't build this yet." Restraint is a result you can ship.

06Where it landed

Eleven days, one design spec to a public release. And it was built the way it now asks you to work: spec it, test it, review it adversarially, verify it for real, write down what you learned, and hand your agent the parts an agent is good at.

Shipped

• 326 commits · 24 design specs
• Continuously-audited redaction core
• Neutral Gem archive (manifest + lock)
• GitHub-backed registry

Targets

• Eve · Flue · OpenAI Sandbox
• Codex · Claude-managed
• Bedrock AgentCore
• A2A Agent Card / server

Surfaces

• Local web UI + MCP
• Native desktop app
• Marketing site + docs
• npm: @ninemind/agentgem