Building AgentGem with Claude Code
An empty repo to a public, npm-published product in eleven days: a registry, six deploy targets, a marketing site, a desktop app. Most of it driven through a coding agent. If you're trying to get real work out of one, this is less a launch story than a field manual for how to steer it.
- A real product (registry, six deploy targets, a site, a desktop app) shipped to npm in 11 days, mostly driven through a coding agent.
- The method in one breath: spec before code, an adversarial reviewer on every diff, persistent cross-session memory, one git worktree per session, and verify against the real world.
- The honest part: where the agent fooled itself, including stale compiled tests passing green and parallel agents deleting each other's specs.
- Want just the rules? Jump to the playbook.
AgentGem packages a coding agent's hard-won configuration (skills, MCP servers,
CLAUDE.md) into a portable, composable Gem. What follows is the
eleven days that built it, told from the inside. The product packages coding agents; it was itself
made by working closely with one, so the two stories don't really come apart. Read it for the
parts that transfer: the loop that kept the agent on-rails, the guardrails that caught it when it
drifted, and the failure modes nobody warns you about until you hit them.
00The shape of the work
One fact frames everything. Across eleven days the project's history holds 214 Claude
Code sessions. Roughly 180 of them run a single recurring task,
Review this change for security vulnerabilities, fired automatically on nearly every
diff. Only about twenty are the "let's build X" conversations a human would recognize as
milestones.
The transferable bit is right there: don't bolt review on at the end. Wire an adversarial reviewer into the loop so it runs on every diff, automatically, while the context is still fresh. An agent will happily hand you code that looks right. A second agent told to attack that code is how you find out whether it is.
01What it became: the Gem pipeline
Strip away the eleven days of iteration and the product is one clean pipeline. Read a local agent's config, redact every secret at the moment of capture, crystallize it into a neutral on-disk Gem archive, then fan that one source out to deploy targets and a registry without ever re-reading raw config.
02The eleven days
Day 0–2 · the genesis
The very first commit, on June 14, isn't code. It's a design spec. That set the
pattern the whole project kept: spec first, code second, 24 of them by the end.
The founding intuition was a workflow. You install a coding agent, configure skills and MCP
servers and hooks, tune it until it actually works, and then that hard-won setup is stuck
on one machine. AgentGem answers the "and then what?" Two constraints held from the first hour and
never bent: a browser can't read your ~/.claude, so a small local server bridges it;
and secrets get redacted the instant they're read, by value and by key name.
Day 4 · the pivot to a format
What about turning pack.json into an archive/fs with a manifest for agents? I'm aiming for flue, eve, and claude-managed agents, and codex.
This is where "pack" stopped being a blob and became a format: a human-authored
manifest, a generated lock with per-file SHA-256 and a signable digest, the bodies as real files on
disk. One decision, serialize at the edges, made everything after it cheap. The archive
stays neutral, shaped like no single agent, and every consumer reads it the same way. Targets then
multiplied. Eve became the reference pattern, Flue added the
reusable compose hook, OpenAI Sandbox reused it. A habit shows up
here that the project kept: read the real docs and pin versions from the real tarballs,
because guessing at an API surface is how you ship code that won't compile.
Day 5–6 · research as steering, and the registry
Here Claude Code worked less as a typist than as a research analyst that could change
direction. A shadcn thread about treating a GitHub repo as a distribution protocol became
the Gem Registry the next day, built with subagent-driven TDD across ten tasks,
reviewed per task and again across the whole branch. The same stretch shipped the
testbed-first inversion. Instead of "introspect your machine," the canonical
surface became a local .claude/ project you test-drive with your own Claude Code, then
package. Packaging took zero new code; it reused the archive's selection model.
Day 7–8 · meet users where they are
The testbed grew past Claude to also cover Codex and Hermes. One finding was
worth pinning to memory: Hermes has no MCP-server concept at all, so it's permanently
skip-reported rather than deferred. That's the gap between "we haven't built it" and "it can't
exist," and only reading the real ~/.hermes/config.yaml surfaced it.
Day 9 · going public
A website, a README, and the open-source prep. One lesson from the git side is worth saving to
memory: a force-push doesn't remove unreachable commits from GitHub's object store, so if
you ever need history truly gone, a force-push isn't enough. AgentGem went public and shipped to npm
as @ninemind/agentgem. The website taught its own lesson.
Lead with the outcome, not the build process. A five-step workflow rail read as
intimidating, so it became a three-beat You've got an agent → Share it → Profit.
Putting the site online turned into its own small story. agentgem.ninemind.ai runs as
a Cloudflare Worker (agentgem-web) that serves the built pages straight from Workers
Static Assets, with one twist: a request carrying Accept: text/markdown gets a
hand-authored markdown twin of the page instead of the HTML, so a browser sees the real site and an
LLM reading the docs gets clean prose. The subdomain was brand new with no DNS record, so the
Worker config sets custom_domain = true; one wrangler deploy then
provisions the proxied DNS record and the edge TLS cert together, with no manual DNS step. A GitHub
Action redeploys on every push to website/ or docs/, authenticated by a
scoped Cloudflare API token (Workers Scripts edit, plus DNS and SSL edit on the zone). Cloudflare
actually shows up twice in the project, because it's also a deploy target: a built Gem can be pushed
to Cloudflare Workers via
flue build --target cloudflare and the same wrangler deploy.
Day 10–11 · strategy, a desktop app, and the agent inside the agent
The final stretch widened into product strategy, a marketplace question and other direction calls, while still shipping: an Electron desktop app in its own worktree, an A2A target, and a deliberately restrained identity decision. AgentGem declares; it does not enforce. So don't build a primitive it doesn't need yet. Knowing what not to build counts as a result.
The desktop app deserves a closer look, because it shows what "add a native app" should mean when
you hand it to an agent. The brief was deliberately narrow: a thin host, not a UI rewrite.
Electron's main process loads the already-built core, calls the same createApp(port)
the CLI uses on an OS-assigned localhost port, and points a window at it. REST, MCP, and the API
explorer keep working untouched; the renderer is the existing web page. The native folder picker,
menu, tray, and auto-update get bolted on through a preload bridge, and nothing else moves. The
whole thing lives in an isolated desktop/ folder with its own node_modules,
so Electron's heavy binaries never leak into the lean library that publishes to npm. A
cross-platform GitHub Action then builds macOS, Windows, and Linux installers in a matrix. Even
there a small gotcha surfaced: electron-builder's default release tag (v{version})
collides with the npm core's tag, so the desktop release rides its own desktop-v* tag.
The lesson for driving work like this: constrain the agent to wrap what already exists. "Reuse the
server, add a window" is a scope an agent nails; "build a desktop app" invites it to reinvent three
things you already shipped.
03How it was built: the Claude Code loop
Underneath the features is a repeatable method, and if you take one thing from this piece, take this. It isn't AgentGem-specific. Every significant change moved through the same loop, isolated in its own git worktree, with persistent memory carrying conclusions between sessions and an adversarial reviewer gating the merge. The agent does the typing; the loop is what keeps the typing honest.
The loop above isn't abstract; it ran on a specific, mostly off-the-shelf toolchain. The numbers are how many of the 214 sessions each one shows up in.
Skills
- • superpowers: brainstorm → plan → TDD → subagent build → review (40)
- • /security-review, on every diff (214)
- • browser-harness, live UX checks (15)
- • using-git-worktrees (21)
MCP & protocols
- • context7, to pin real library docs (55)
- • ACP: claude-agent-acp / codex-acp (7)
- • AgentBack, the app's own MCP surface
- • persistent cross-session memory
CLI & ops
- • pnpm · vitest · tsc -b (148)
- • wrangler · vercel · agentcore deploys
- • git filter-repo --mailmap, history rewrite
- • gh · npm publish
Two traps the agent won't flag for you
The memory files carry one scar over and over. When two sessions shared a single checkout, branches got scrambled, and one agent twice deleted another session's design spec as "scope creep." Agents working in parallel will step on each other with total confidence. The rule of one worktree per session was earned, not assumed.
The subtler trap cost more. The test runner executed compiled tests from
dist/, and tsc -b is incremental, so after a big rename the stale
compiled tests lingered and kept passing against the old code. A green check that means
nothing. The agent won't catch this, because from where it sits the suite is passing. The fix is a
habit you have to teach it (rm -rf dist && rebuild), and the real lesson is older than
agents: a passing suite only means what you think it means if you know how the suite is wired.
04The agent inside the agent
The most reflexive feature closes the loop completely. Analyze reads your session transcripts and computes a deterministic usage signal, counting only the tools you actually invoked rather than the whole available-but-unused catalog. Then it asks a local coding agent, over ACP, which artifacts to bundle. So AgentGem packages coding agents by consulting a coding agent. A guard drops anything the agent names that isn't really in your inventory.
That last sentence is the pattern worth copying when you let an agent make decisions. Give it a deterministic signal to reason over, not a blank prompt. Let it propose. Then check its proposal against an authoritative source it can't talk its way around. The agent is good at judgment and bad at staying honest about facts, so you keep the judgment and veto the facts.
05What transfers
None of this is exotic. Strip away AgentGem and what's left is a way of working you can run on your own repo tomorrow. The shortlist:
- Spec before code. A one-page design doc per feature is the cheapest way to keep an agent on-rails; it argues with you on the page instead of in the diff. This project wrote 24.
- Put the reviewer in the loop, not at the end. An automated adversarial pass on every diff, while context is fresh, catches what a final review never will. It ran ~180 times here.
- Give the agent a memory. Write durable decisions and gotchas to a store that outlives the session, or you'll re-explain them next week and re-make the same mistake.
- One worktree per session. Parallel agents on a shared checkout will scramble branches and delete each other's work, confidently. Isolation is not optional.
- Make it prove things. Drive the real app, run the real deploy, pin versions from real tarballs. "Tests pass" and "it works" are claims until you've watched them.
- Pin docs; don't trust recall. A live-docs source (here, context7) stops the agent inventing an API that looked plausible and never existed.
- Let it read the outside world. A linked thread became the registry; a vendor's docs became three deploy targets. The agent is a decent research analyst if you point it outward.
- Decide what not to build. The most valuable design session of the eleven days ended in "don't build this yet." Restraint is a result you can ship.
06Where it landed
Eleven days, one design spec to a public release. And it was built the way it now asks you to work: spec it, test it, review it adversarially, verify it for real, write down what you learned, and hand your agent the parts an agent is good at.
Shipped
- • 326 commits · 24 design specs
- • Continuously-audited redaction core
- • Neutral Gem archive (manifest + lock)
- • GitHub-backed registry
Targets
- • Eve · Flue · OpenAI Sandbox
- • Codex · Claude-managed
- • Bedrock AgentCore
- • A2A Agent Card / server
Surfaces
- • Local web UI + MCP
- • Native desktop app
- • Marketing site + docs
- • npm: @ninemind/agentgem