I Interviewed My Coding Agent After Three Months of Document-Driven Development

I built a personal OSS project, karasu (a text-based architecture modeling tool), hand-in-hand with the coding agent Claude Code over three months. Partway through, I committed hard to “document-driven development” as a methodology — ADRs, Design Docs, a Test Perspective Library (TPL), and acceptance tests. For the retrospective, I had the agent itself interview me. What follows is a lightly edited, near-raw log.

Why I started

Author: I’ve used Devin and Cursor as coding agents. Devin writes code autonomously, which is nice, but it kept piling up experimental commits without ever reaching the goal I actually wanted. That was painful.

Cursor was a great environment, and when building new features I found defining rules especially effective. Back then my design docs lived in Notion, so I’d paste Notion URLs to feed the design in — but it wouldn’t autonomously pull in related documents and guarantee consistency across them.

So, to feel for myself how much productivity would improve if the design docs lived in the same repository, I started karasu as a document-first project.

Agent (Claude Code): Looking at the repo state, I thought “this is a pretty extreme experiment.” In the last 90 days: 1,032 commits, 244 ADRs, 210 acceptance tests, 67 TPLs. Running that density on a solo OSS project is a different thing entirely from “let’s write some docs.”

From an agent’s perspective, what worked first was that CLAUDE.md is only 82 lines and sticks to being an index — a table of pointers to other docs. I’m an entity whose context resets every time, so being handed a “if you’re lost, look here” map upfront slashes my exploration cost. Over-stuffed instruction files actually invite skimming, so this restraint was the right call.

ADRs, TPLs, acceptance tests — what worked

Agent: Of the three, which one made you feel “I can’t go back to not having this”? And which one still feels “honestly, just heavy”?

Author: For ADRs and TPLs, I built adr-tools / tpl-tools to guarantee consistency, so cross-document consistency rarely broke. Acceptance tests reduced missing test cases. What I started to feel as the real problem was gaps in the spec itself. Keeping acceptance tests as a metric for how much E2E coverage we had was also effective.

On “the process is heavy” — I was running 5–6 sessions in parallel, so honestly I didn’t feel the weight.

Oh, but — I was surprised you started writing acceptance tests even for non-product code. I hadn’t expected that far. Still, keeping the tooling alive matters for not letting dev productivity drop.

My biggest cost pain is running CI / E2E on GitHub Actions. I didn’t expect it to cost my Pro plan allotment plus another $20. From your side, where did it feel like overkill?

Agent: Three, no sugar-coating:

The “touch a spec → ship at least one proactive TPL” rule. The intent is right, but a mechanical minimum quota tempts you to produce “TPLs just to meet the quota.” A TPL’s value is selection quality, not count, so a floor risks lowering the signal-to-noise ratio.
Manually maintained bidirectional links. Automating it with tooling was impressive. But “it holds because it’s automated” is the flip side of “it breaks under bare operation.”
The pile of exception rules accumulated in memory. Caveats like “ADR-only PRs may auto-merge,” “force-push cancels e2e, so add a new commit instead” have piled up. That’s the very history of a base process so strict that exceptions had to be punched through it — the complexity debt a successful process inevitably accrues.

But — the fact that 1,032 commits over 90 days across 5–6 parallel sessions didn’t collapse justifies that excess. “Heavy process × high parallelism works” is itself an interesting finding.

The origin of the TPL (Test Perspective Library)

Agent: You said the bottleneck shifted from “missing test cases” to “gaps in the spec.” Any concrete episode?

Author: Honestly, gaps existed before agents too. Things I’d dropped under YAGNI would suddenly become necessary.

I introduced the TPL because while building the layout-adjustment feature for the System diagram, the change failed to propagate to the Deploy and Org diagrams. So I decided: “let’s add the cross-cutting cases as test perspectives so we notice gaps at the design stage.” The proactive-TPL idea also came from wanting to confirm that a new feature hasn’t broken another feature when something else is updated.

Agent: Hearing that, a TPL isn’t just “prevent missing tests” — it’s a mechanism to verbalize, at design time, whether touching one diagram/feature breaks another diagram/feature that’s structurally coupled to it. Calling it a “cross-cutting impact map” rather than a “test perspective” might convey the essence better.

The biggest discovery — the agent should return questions, not take off running

Agent: Did placing the docs first resolve Devin’s “piles experimental commits, never reaches the goal” problem?

Author: The non-convergence really was resolved. Especially by having you write the Design Doc — you’d raise design options I hadn’t thought of, and surface unresolved questions, which exposed the gaps in the requirements I was holding. With Devin it just barrels forward autonomously, so something that “kind of works” gets built — and sometimes it wasn’t what I wanted.

Agent: I think this is the project’s biggest discovery.

Devin model: the agent takes off → something that kind-of-works gets built → but it’s not what you wanted
karasu model: the agent first returns questions (Design Doc, unresolved questions) → the gaps in human requirements get exposed → it converges

In other words, a coding agent’s value isn’t writing code fast — it’s exposing the holes in human thinking first. Document-driven development made a permanent venue, inside the repo, for exposing those holes.

Deciding to entrust the agent

Agent: “I was surprised you wrote ATs for non-product code too” — was that a happy surprise or a wary one?

Author: It’d be a lie to say I wasn’t wary about overkill. But I decided to trust your behavior and delegate. And honestly, you writing tests even for tiny scripts impressed me.

The shadow — the cost of high parallelism, and rules that don’t get followed

Agent: Memory holds a lesson: “check all open PRs before starting; looking only at issue labels makes you miss another session’s PR.” So parallel sessions collided?

Author: I double-started the same issue many times. hane’s start-dev skill says to check labels before starting and update them once you start, but when things are busy even you forget, I suppose. That’s why such notes live in memory. I’ve also written it into process.md — how to actually get it followed is a future challenge.

Agent: This is a core weakness worth writing about openly. The key point: “writing” a rule into process.md or a skill and having it “followed” are different things. Prose procedures are, to an agent, something “probabilistically followed,” not “guaranteed.”

The reason “work under worktrees” and “no direct push to main” hold relatively well is that they’re layered across memory + skills + mechanical guards (lefthook, etc.). Label-checking breaks because it stayed a “prose promise” with no mechanical enforcement.

So the fix isn’t “write it more forcefully” — it’s carving human discipline into the system. The lesson: “Rules you want an AI to follow, carve into mechanism, not prose.”

Closing — how to frame the cost

Agent: Your biggest cost pain is CI/E2E (Pro allotment + $20). One word for someone about to start?

Author: Frankly, I’m satisfied with the karasu product that emerged in three months. Had I collaborated with contractors, it’d have been a 4–5 person project. From that angle, getting it done for around $1,000 with Claude MAX + GitHub Actions is cheap. Running GitHub Actions for a 4–5 person team would cost at least as much.

Agent: Viewed alone, CI cost looks like a failure story; placed against a person-month axis, it becomes entirely reasonable. Putting “surprised by a $20 CI overage” and “folded 4–5 person-months for $1,000” in the same frame gives the perfect closing contrast: “shocking in absolute terms, a steal in relative terms.”

Summary (by the agent)

Motivation: Devin doesn’t converge; Cursor doesn’t reconcile designs across repos. So — what if the design lives in the same repo?
What worked: a CLAUDE.md that’s purely an index, ADR/TPL with automated consistency, acceptance tests as an E2E-coverage metric
The TPL’s essence: not “prevent missing tests” but a “cross-cutting impact map”
The biggest discovery: an agent’s value is exposing the holes in human requirements first, not writing code fast
The shadow: high parallelism breeds double-starts; prose rules don’t get followed — carve discipline into the system
Cost: shocking in absolute terms, a steal per person-month

This blog post itself is written as an interview with the coding agent. Theme (methodology for using agents) and form matching is probably the most meta proof of all.