Here Comes the Judge
Your agent says it completed the goal. How do you know? You need judges — composable, tiered, and often free. Here's how to build a jury system for your agents.
Most teams evaluate their agents the same way: run it, check the build, write some JUnit tests, eyeball the output.
That's a reasonable place to start. But how would you know when it stops being enough?
Java developers have spent twenty years building testing culture — JUnit, AssertJ, Mockito, TDD. The instinct to verify before shipping is already there. What's missing is the tooling to apply that instinct to agent output: composable checks, tiered evaluation, structured feedback.
The industry calls this practice evals — systematic evaluation of AI system outputs. An eval can be as simple as "did the build pass?" or as nuanced as "does this code follow idiomatic patterns?" In the Python ecosystem, evals aren't an advanced topic. They're introductory material. In Andrew Ng's Agentic AI course, "Evaluating agentic AI" appears in Module 1 — right after "Task decomposition: Identifying the steps in a workflow" — the problem agent-workflow addresses — and right before "Agentic design patterns." Ng bridges the two directly: "to drive this improvement process, one of the key skills is to know how to evaluate your agentic workflow."
This isn't one course. Arize AI partnered with DeepLearning.ai to build an entire course just for agent evaluation — five labs, structured experiments, trajectory analysis. AWS teaches RAGAS + LangFuse evaluation as the final lab in its Strands agent framework course. crewAI ships self-evaluation loops as framework-provided patterns.
Java is not as far along. I introduced Spring AI's Evaluator interface and RelevancyEvaluator in April 2024 — a single-method interface, binary pass/fail, scoped to individual LLM responses. LangChain4j, despite LangChain Python's rich eval support, ships no evaluation framework at all. A later Spring AI blog post shows how to build an LLM-as-Judge self-refinement loop using Recursive Advisors — conceptually similar to what DSPy does (evaluate → feedback → retry), but a recipe rather than a shipped framework. That refinement pattern is closer to what agent-experiment does — the driver behind the code coverage experiments and the other experiments in this series. More on that in a future post.
On the scoring side, Quarkus LangChain4j ships a JUnit 5 eval extension with @EvaluationTest and Scorer. The most complete standalone effort I've found is Dokimos by Fabio Kapsahili — I discovered it while doing final research for this post. These are starting points, but they're focused on scoring individual LLM responses.
This post is about a different problem: verifying completed agent work. An agent edits files, runs builds, modifies test suites, refactors across modules. You need to check: did the build pass? Did coverage regress? Are the API migrations correct? Is the code actually good? And you need those checks to compose — cheap deterministic gates first, expensive LLM grading only when the basics pass.
If you've been following this series, you've already seen this at work — three judges in the code coverage experiments, JudgeGate as a workflow DSL primitive, a judge verdict with structured reasoning driving a routing decision in Fork in the Code.
Scores appeared. Verdicts appeared. The machinery behind them never got explained. This post introduces agent-judge — the library behind all of it.
Here Comes the Judge
The Python ecosystem has half a dozen mature eval frameworks. I researched all of them — deepeval, ragas, judges, Weave, Phoenix, Braintrust — and studied what worked. What's the best scoring model? What's the best context object? Where do deterministic checks fit? How should judges compose?
Each framework contributed something: deepeval's rich judgment metadata, ragas's approach to carrying evaluation inputs, Quotient AI's judges library's flexible score types, Braintrust and Weave's comparison and composition patterns. What I rejected was equally important — most Python frameworks treat LLM metrics as primary and deterministic checks as secondary. For agent evaluation, where the agent edits files, runs builds, and modifies code, deterministic checks are faster, cheaper, and more reliable. They get first-class support.
Academic research informed specific architectural choices. Cascaded evaluation saves 78–87% of cost by starting with cheap judges and escalating to expensive ones only when confidence is low (Jung et al. 2024) — that's CascadedJury with TierPolicy. Multi-judge ensembles outperform any single evaluation metric by 29–140% in correlation with human judgments (Zhuo et al. 2025) — that's SimpleJury with pluggable VotingStrategy. And decomposing evaluation into per-criterion binary checks produces more reliable automated judging than holistic scoring (He et al. 2025) — that's granular Check objects inside every Judgment.
One research finding shaped the design more than any framework: a score alone isn't enough. Kamoi et al. (2024) found that LLMs are good at fixing problems when told specifically what's wrong — the bottleneck is the quality of the feedback, not the model's ability to act on it. A judge that returns 0.667 tells the agent to "do better." A judge that returns structured reasoning — the build passed but coverage dropped 8 points and javax imports remain — tells the agent what to fix. This is why every Judgment carries reasoning and checks, not just a number.
The result is agent-judge — currently at v0.11.0 on Maven Central.
The Architecture at a Glance
Before diving into interfaces and code, the big picture:
Spring AI LangChain4j Koog AgentClient (CLI agents)
| | | |
v v v v
─────────────────────────────────────────────────────────
thin adapters — provided-scope, 30-100 lines each
─────────────────────────────────────────────────────────
|
v
─────────────────────────────────────────────────────────
agent-judge-core (pure Java, zero deps)
build checks · file comparison · LLM-as-judge
RAG evaluation · juries · cascaded juries
─────────────────────────────────────────────────────────
Agent runtimes are vertical stacks. Evaluation cuts across all of them. A FaithfulnessJudge doesn't care whether the answer came from Spring AI, LangChain4j, Koog, or Claude Code — the same jury evaluates all of them.
The building blocks: a Judge evaluates one aspect of agent output. A Jury composes multiple judges with a voting strategy. A CascadedJury sequences juries into tiers — cheap deterministic checks first, expensive LLM evaluation only when the basics pass. The final output is a Verdict carrying aggregated and per-judge results.
The Judge Interface
The entire evaluation model starts with a single functional interface:
@FunctionalInterface
public interface Judge {
Judgment judge(JudgmentContext context);
}
JudgmentContext is a record carrying the goal, workspace path, execution time, agent output, and arbitrary metadata. Judgment is what comes back: a score, a status (PASS, FAIL, ABSTAIN, or ERROR), reasoning, and a list of checks.
ABSTAIN is worth calling out: a judge that can't evaluate — missing data, out of scope — abstains rather than guessing. The jury doesn't count it as a pass, but doesn't reject either.
Score is a sealed interface — not just a primitive wrapper:
public sealed interface Score permits BooleanScore, NumericalScore, CategoricalScore {}
The reason for the type hierarchy: jury aggregation. A MajorityVotingStrategy needs BooleanScore to count votes. A WeightedAverageStrategy needs NumericalScore to compute weighted means. NumericalScore provides normalized() for mapping arbitrary ranges (0–10, percentage, whatever) into 0–1 for uniform comparison.
The reasoning — why something passed or failed — lives on Judgment itself, alongside a List<Check> for granular sub-assertions. Binary criteria produce more consistent results than rating scales — five specific pass/fail checks with individual reasoning, rather than one fuzzy number.
Why One Judge Isn't Enough
In the code coverage experiments I asked an agent to write JUnit tests for Spring Petclinic and increase coverage from 50% to 85%. Think about what a human would check when reviewing the agent's work:
- Did the project still compile and pass its tests?
- Did coverage actually improve?
- Are the tests any good — or did the agent game coverage with empty assertions?
Those are three different kinds of evaluation, and they have very different costs. The first is a ./mvnw test invocation — free, instant, deterministic. The second is parsing a JaCoCo report — still free and deterministic. The third requires judgment — an LLM reading the generated tests and evaluating whether they exercise meaningful behavior.
If the build is broken, there's no point checking coverage. If coverage didn't improve, there's no point asking an LLM to evaluate test quality. Each check gates the next. This is the experiment's actual jury configuration:
JuryFactory.builder()
.addJudge(0, BuildSuccessJudge.maven("clean", "test", "jacoco:report"))
.tierPolicy(0, TierPolicy.REJECT_ON_ANY_FAIL)
.addJudge(1, new CoverageImprovementJudge(50.0, 85.0))
.tierPolicy(1, TierPolicy.REJECT_ON_ANY_FAIL)
.addJudge(2, new TestQualityJudge(agentClientFactory, judgePromptPath))
.tierPolicy(2, TierPolicy.FINAL_TIER)
.build();
Tier 0 runs the build. Tier 1 checks the JaCoCo report. Only if both pass does Tier 2 invoke an LLM to evaluate test quality. Or as I put it in How Does Your @Agent Flow?: gate early, gate often.
Composing Judges
The code coverage example showed CascadedJury — sequential tiers with fail-fast policies. When you need judges to run in parallel instead, SimpleJury aggregates them with a voting strategy (majority, average, weighted average, median, or consensus). The Verdict carries both the aggregated result and every individual judgment keyed by judge name. See the Jury System docs for the full composition API.
The Cascade in Action
Here's what this looks like on a real PR. Spring AI PR #5764 — a new health indicator for PgVector.
T0 — Build Judge: PASS. Rebase clean, no complex conflicts, build succeeds, tests pass. Four checks, all green. Cost: $0.
T1 — Version Pattern Judge: FAIL. javax.persistence import found in PgVectorStoreHealthAutoConfiguration.java. Should be jakarta.persistence in Boot 4. Cost: $0.
T2 — Quality Judge: In a strict cascade, T2 would be skipped — T1 already rejected. But in our PR review pipeline, T2 runs independently to provide feedback. The LLM found more: queryForObject() throws EmptyResultDataAccessException, not null — dead code at lines 48–50. Tests mocked behavior that never occurs in production. Code quality score: 0.55. Backport suitability: FAIL.
The deterministic judge caught a migration anti-pattern for free. In this pipeline, the LLM still runs for advisory feedback even after T1 rejects — and it found additional semantic bugs that unit tests missed. But the critical point: T1 alone was sufficient to block the merge, at zero cost.
Compare that to PR #5659 — custom StructuredOutputConverters. T0: FAIL — four compile errors after rebase. But the pipeline doesn't stop there. An AI agent diagnoses the API incompatibility, fixes the tests across three iterations (38 turns, $2.12), and T0 passes on retry. T1: PASS. T2: PASS with a quality score of 0.70. Three different outcomes from one cascade: reject, recover, approve.
In the code coverage experiments, where the cascade does short-circuit, about a third of runs fail at the deterministic tiers and never reach the LLM — evaluation for free.
Judges and Journals
A judge tells you what failed. It can't tell you why. A coverage judge reports that test coverage dropped 8 points — but did the agent forget to write tests, or did it spend its entire context window stuck in a retry loop fixing an unrelated import?
The answer is in the agent's behavioral trace — the journal. Every tool call, every file edit, every retry. The journal records the process; the judge evaluates the outcome. Markov analysis turns the journal into patterns — hotspots, loops, transitions that predict failure. Judges give you outcome data; journals give you process data. Together, they form the experiment loop that makes agents actually improvable.
What Judges Unlock
Judges are not isolated prompts. They are typed, composable evaluation components inside deterministic workflows — the difference between "LLM-as-a-judge" as a technique and evaluation as production infrastructure.
Once you have that infrastructure, model selection becomes data-driven. Swap Claude for Gemini, run the same jury, compare verdicts. Same benchmark, cheaper model — if it passes the judges, ship it.
And once evaluation becomes infrastructure, a harder question follows: who judges the judges? The moment you rely on an evaluator to gate merges or score agent output, the evaluator itself becomes a critical system component — subject to drift, prompt sensitivity, and alignment mismatch. That's a topic for a future post.
Getting Started
<dependency>
<groupId>io.github.markpollack</groupId>
<artifactId>agent-judge-core</artifactId>
<version>0.11.0</version>
</dependency>
Start with the core, then add only the modules you need:
Judge families (what you evaluate with):
| Module | Dependencies | What it checks |
|---|---|---|
agent-judge-core |
None | Judge, Jury, Score, Verdict |
agent-judge-exec |
Process execution | Build success, shell commands, coverage |
agent-judge-file |
File I/O | AST comparison, POM diffs, XML, text |
agent-judge-ai-core |
agent-judge-core only | ModelBackedJudge, JudgeModel interface, prompt templates (framework-neutral) |
agent-judge-llm |
Spring AI | Spring AI implementation of JudgeModel (SpringAiJudgeModel) |
agent-judge-rag |
agent-judge-llm | Faithfulness, hallucination, contextual relevance |
Framework bridges (what you evaluate):
| Module | Adapts | One-liner |
|---|---|---|
agent-judge-spring-ai |
ChatResponse |
SpringAiEvaluator.evaluate(goal, call, jury) |
agent-judge-langchain4j |
Result<T> |
LangChain4jEvaluator.evaluate(goal, serviceCall, jury) |
agent-judge-koog |
AIAgent output |
KoogEvaluator.evaluate(agent, input, jury) |
agent-judge-agent-client |
CLI agents | AgentClientEvaluator.evaluate(goal, workspace, call, jury) |
Each bridge is thin — 30–100 lines. The framework dependency is provided-scope: you bring your own version. agent-judge-core has zero runtime dependencies beyond java.base. See the Built-in Judges catalog for the full list.
Judge buildJudge = BuildSuccessJudge.maven();
JudgmentContext context = JudgmentContext.builder()
.goal("Migrate javax imports to jakarta")
.workspace(Path.of("./my-project"))
.build();
Judgment result = buildJudge.judge(context);
assertThat(result.status()).isEqualTo(JudgmentStatus.PASS);
Or use a framework bridge to evaluate in one line:
// Koog agent
Verdict verdict = KoogEvaluator.evaluate(agent, "Add a REST controller", jury);
// LangChain4j AI Service
Verdict verdict = LangChain4jEvaluator.evaluate(goal, assistant::chat, jury);
// Spring AI ChatClient
Verdict verdict = SpringAiEvaluator.evaluate(goal, () -> chatClient.prompt()
.user(prompt).call().chatResponse(), jury);
The agent-judge-tutorial project has working examples for everything in this post. Full documentation: Getting Started | Tutorial | API Reference. Source on GitHub.
Judges are tests for your agent. Start simple, add tiers as your agent matures.