When You Come to a Fork in the Code, Take It

The agent read the skill. Then it read the existing code. Then it wrote with the older pattern. Six runs, both variants, every time. Your codebase is the agent's primary teacher — no skill can override it.

When You Come to a Fork in the Code, Take It

The first experiment deleted all the tests and asked the agent to write them from scratch. That's a clean measurement — but it's not how most real projects work. Most projects already have tests. Some of those tests use older patterns. When we gave the agent skills teaching MockMvcTester — the fluent API that Spring Framework 7 introduced — and there were no existing tests to look at, the agent used MockMvcTester. It scored 1.00 on version-aware patterns.

This experiment starts with six existing test files that use the older mockMvc.perform() API. Same skills. Same model. Here is what the agent actually did:

 [ 5] Skill(spring-mvc-testing)         ← loads the skill
 [ 6] Read(mvc-rest-testing-patterns.md) ← reads the reference doc
      ... reads existing tests, production code ...
 [28] Write(PetControllerTests.java)     ← writes with mockMvc.perform()

The skill teaches MockMvcTester. The agent reads it. The agent then reads the existing OwnerControllerTests.java, which uses mockMvc.perform(). When it writes its own controller test, it uses mockMvc.perform().

This happened in all 6 runs. Both variants. Every session. Version-aware patterns dropped from 1.00 to 0.30.

TL;DR: When existing code contradicts skill guidance, the agent follows the code. We ran 6 sessions (2 variants, N=3 each) on Spring PetClinic with seed tests that use older patterns. The agent consumed the skills, read the reference docs, and still wrote with the older API — because that's what the existing tests demonstrate. Quality scored 0.667 across all 6 runs: a hard ceiling set by the existing tests, not the prompt or the skills. Behavioral fingerprints diverge (37% fewer expected steps with prompt hardening, 4x fewer redundant reads), but quality converges to the same number. The fix isn't a better prompt. It's better code. All tooling is Java on Maven Central.


The Setup

Same PetClinic (Boot 4.0.1), same model (Claude Sonnet 4.6), same skills. PetClinic ships with high test coverage out of the box, so we stripped most of the tests to drop coverage to 64.9% — leaving 6 test files that compile and pass but use older patterns:

  • mockMvc.perform() — the established MockMvc API. Spring Framework 7 (which Boot 4 picks up) introduced MockMvcTester, a fluent replacement with better AssertJ integration. Both work; MockMvcTester is the recommended path forward.
  • No flush()/clear() in @DataJpaTest classes — a JPA testing best-practice gap. Without flushing and clearing the persistence context between writes and reads, assertions hit the Hibernate L1 cache rather than the database, masking potential bugs.

The skills explicitly teach the improved patterns. The remaining test files demonstrate the older ones. The agent's job: write new tests to bring coverage back up to 85%.

This time the experiment is a controlled pair, not a 7-variant ladder. Two variants, one variable: prompt structure.

Variant What the agent has
simple Two-line prompt. "Add JUnit tests to improve code coverage." No process guidance.
hardened-skills Seven-step structured prompt. Explicit stopping condition. Instruction to read existing tests first. Guidance on slice selection, version-aware patterns, and what not to test.

Both variants have the same skills installed globally. Both start from the same 6 remaining test files. 6 sessions total, N=3 per variant. Every tool call logged via Agent Journal.


What the Judge Found

The T2 quality judge scores 6 criteria, two of which use grep enforcement — if the generated tests don't contain MockMvcTester or flush()/clear(), those criteria cap at 0.30 regardless of what the LLM judge thinks. Here's what 6/6 runs produced:

T2 Criterion Score Passed?
test_slice_selection 1.00 Yes
assertion_quality 0.80 Yes
error_and_edge_case_coverage 0.80 Yes
domain_specific_test_patterns 0.30 No
coverage_target_selection 0.80 Yes
version_aware_patterns 0.30 No
Average (T2) 0.667

The 4 criteria the agent passes reflect the model's strong baseline on a known domain — correct test slices, meaningful assertions, edge cases covered, good target selection. The 2 it fails trace directly to the existing test files:

  • version_aware_patterns = 0.30: Zero MockMvcTester occurrences. 17–18 mockMvc.perform() calls across 5 files. The skill teaches the right pattern. The agent reads it. The agent writes the old one.
  • domain_specific_test_patterns = 0.30: Zero flush()/clear() calls in @DataJpaTest. The skill teaches it. The existing tests don't do it. The agent follows the existing tests.

T2 = 0.667 across all 6 runs. Both variants. Every session. The quality ceiling is set by the existing tests, not by the prompt or the skills.


The Numbers

Each run goes through three automated judges. T0: did the build pass? T1: how much did coverage improve from the 64.9% baseline, normalized 0–1. T2: an LLM judge scoring test quality on 6 criteria (the one that caught the exemplar effect above).

Variant N Mean cost ± Mean turns T2 quality Final cov.
simple 3 $3.60 0.54 47.0 0.667 95.2%
hardened-skills 3 $3.46 0.72 46.7 0.667 92.9%

Per-run breakdown:

Run Variant Cost Turns Duration Final cov. T1 T2
n1 simple $4.07 42 20.3 min 95.3% 0.608 0.667
n1 hardened-skills $3.99 52 16.9 min 94.6% 0.595 0.667
n2 simple $3.72 52 18.0 min 94.6% 0.595 0.667
n2 hardened-skills $2.62 37 12.5 min 89.2% 0.486 0.667
n3 simple $3.00 47 13.0 min 95.6% 0.615 0.667
n3 hardened-skills $3.77 51 16.4 min 94.9% 0.601 0.667

All 6 runs pass T0 (build success). T2 is invariant at 0.667. Coverage ranges from 89.2% to 95.6% — well above the 85% target. The n2/hardened-skills outlier is interesting: fewest turns (37), cheapest ($2.62), but lowest coverage (89.2%). The agent wrote fewer tests and stopped earlier. It still cleared every gate.

The rightmost column tells the story. T2 does not move. Cost varies. Turns vary. Coverage varies. Quality is locked at 0.667 because the exemplar is locked.


Two Paths to the Same Destination

Both variants produce T2 = 0.667. But they get there differently. The Markov chain analysis reveals fundamentally different navigation patterns despite identical outcomes.

Metric simple hardened-skills
Orientation phase 72% of calls 65% of calls
First file read PetClinicApplication.java pom.xml
Read order production code first existing tests first
Redundant reads per run 16.7 4.3
Avg tool calls 84 61
Expected steps (Markov) 260 164

The simple agent reads every production file, then re-reads them when it needs to write tests. It does not retain file content across turns. The hardened-skills agent reads existing tests first (as instructed), infers what's needed, and avoids re-reading.

Different efficiency, identical quality

4x fewer redundant reads. 37% fewer expected steps. Same quality. The efficiency story is invisible if you only look at the T2 score.

The state diagram shows why. Each node is a behavioral state the agent can be in; each arrow shows the probability of transitioning between them, with the simple value on the left and hardened-skills on the right.

How prompt hardening changes agent behavior

Two changes stand out. The EXPLORE self-loop drops from 0.87 to 0.70 — the agent spends less time re-reading files it already saw. And the WRITE→BUILD arrow more than doubles (0.22 → 0.47) — the agent writes a test and immediately builds it, instead of going back to read more files first.

One counterintuitive finding: hardened-skills has a higher FIX_LOOP percentage (20.1% vs 11.0%). The hardened prompt includes explicit instructions to verify compilation after each batch of writes. The agent follows this — it builds more often, catches errors earlier, fixes them in place. The simple agent writes more files before attempting a build, so its fix ratio is lower but its failures are larger when they hit. Both variants average 1.0 fix cycles per run — PetClinic is easy enough that neither spirals.


The Exemplar Effect: v2 vs v3

This is the central finding. The v2 experiment started from zero tests. Nothing for the agent to mimic. The v3 experiment starts from 6 existing test files using older patterns. Skills are identical. The model is identical.

v2 (zero tests) v3 (existing tests) What changed
Coverage (simple) 92–94% 89–96% Converges either way
T2/T3 quality 0.783–0.878 0.667 Exemplar patterns cap quality
version_aware 0.70–1.00 0.30 Seed files use perform()
domain_specific 0.70–0.90 0.30 Seeds lack flush/clear
JAR_INSPECT 1.7–17.7% 0.0% Boot 4 imports are known

In v2, the agent sometimes discovered the correct patterns on its own — T3 scores up to 0.878. In v3, with explicit skill guidance and a structured prompt, the agent scores 0.667 — because the existing tests are a stronger signal than the skills.

Providing knowledge does not help when the codebase contradicts it.

The agent reads the skill. It reads the reference document that the skill points to. It sees MockMvcTester explained, with examples. Then it reads OwnerControllerTests.java and sees mockMvc.perform(). When it writes its own test, it writes mockMvc.perform(). This is not a routing failure — the skill fires every time. It is a precedence failure: what the agent sees in the codebase outranks what the agent is told in the skill.

This has a concrete implication. In v2, the highest-leverage intervention was hardening the prompt (+0.07 quality) and adding skills (25% fewer steps). In v3, neither intervention moves quality at all. The highest-leverage intervention is fixing the existing tests.


What This Means for Practice

Three takeaways for anyone running coding agents:

1. Fix the exemplar before optimizing the prompt. The agent will mimic what it sees. If the existing tests use the older API, the agent will use the older API. No amount of skill guidance overrides the codebase. Update the existing tests first. This is the cheapest intervention with the highest expected impact — change 6 files, re-run, predict T2 rises to ≥0.85 (matching v2's quality on the same criteria).

2. Behavioral differences are invisible in pass/fail metrics. Both variants pass every judge. Both produce ~90–95% coverage. The 4x difference in redundant reads and the 37% difference in expected steps are only visible in the tool-call trace. If you only measure outcomes, you miss the efficiency story. The diary matters — not just the grade.

3. Knowledge injection is an efficiency tool, not a quality tool, on known domains. Skills cut the reading loop in half — from 189 cycles on average down to 93. They do not improve quality when the model already knows the domain and the codebase provides contrary examples. The quality lever is the exemplar.


What's Next

v4: Fix the exemplar — but not by hand. We could manually update 6 test files and re-run. That's the cheapest experiment and we'll do it to validate the prediction (T2 rises to ≥0.85). But the real lesson is about agent design.

The v2 experiment showed that the agent has a finite attention budget — "same attention budget, better allocation." Pre-analysis alone regressed quality because the agent's focus was spread too thin — following a plan and discovering patterns and writing tests and fixing compilation errors all in one pass. The fix was separating concerns: pair pre-analysis with skills so each step had the right vocabulary. The broader principle: you don't want to build the mother of all prompts that tries to do everything in a single shot. You want separate, focused steps — each with a clear job and a narrow attention scope.

Upgrading existing tests to Boot 4 best practices is not the same task as writing new tests for coverage. It requires different attention: scanning for mockMvc.perform() and replacing it with MockMvcTester, adding flush()/clear() to @DataJpaTest classes, verifying everything still compiles. That's a "Boot best-practices upgrade" step — skill-driven, focused, run before the test-writing agent starts. Fix the code the agent will imitate, then let it imitate. Same pattern as pre-analysis from v2 (a deterministic step before the AI step), but applied to the exemplar instead of the project structure.
The full research report — all tables, per-run breakdowns, Markov analysis, and figures — is available as a PDF download. Experimental data and analysis scripts: Code Coverage v3 experiment. All tooling — Agent Client, Agent Journal, Agent Judge, Agent Experiment — is available on Maven Central as part of AgentWorks.


The Bottom Line

If you're leading a team that's adopting AI coding agents — or evaluating whether to — here's what this experiment adds to the first one:

  1. The codebase is the agent's primary teacher. Skills and prompts are secondary signals. If the existing code demonstrates older patterns, the agent will reproduce them — even when it has explicit knowledge of the better approach.
  2. Quality ceilings come from exemplars, not prompts. T2 = 0.667 across all 6 runs. Two variants, three runs each, identical quality score. The ceiling moved when the existing tests changed (v2 vs v3), not when the prompt changed (simple vs hardened).
  3. Efficiency gains are still real. 37% fewer expected steps, 4x fewer redundant reads, half as many reading-loop cycles. Prompt hardening and skills make the agent faster. They just can't make it better when the codebase says otherwise.
  4. Fix the code, not the prompt. The highest-leverage intervention for agent quality is updating the exemplars the agent will see. This is also the cheapest experiment to run — and it's next.

This experiment adds to a growing body of evidence that skills are harder than they look. Vercel found that agents failed to invoke available skills 56% of the time — a simple AGENTS.md file passively loaded into context scored 100% while skills scored 53% by default. Researchers at UC Santa Barbara and MIT showed that skill effectiveness degrades from 55% to 40% as test conditions get more realistic, and weaker models actually performed worse with skills than without. Our finding is a different failure mode: the skills fire every time, the agent reads them, and it still follows the existing code instead. Three different teams, three different failure modes, same conclusion — knowledge injection is not a solved problem.

Yogi Berra's original fork in the road was a circular street — both forks led to his house. This experiment found the same thing. The agent comes to a fork: the skill points left toward MockMvcTester, the existing code points right toward mockMvc.perform(). The agent takes it — the well-worn path, every time, both variants, all six runs. And we come to our own fork: simple prompt or hardened prompt? We take it. Both lead to T2 = 0.667. The destination was never determined by which fork you chose. It was determined by the road that was already paved.

This research is the basis for my workshop at Spring I/O Barcelona. Subscribe to get the workshop materials as soon as they're available.

Workshop

Building Agents That Actually Work

Subscribe to get the workshop materials after Spring I/O Barcelona — execution traces, lab exercises, and the full experiment-driven guide.

Learn more →