Recipe — Gate a prompt change in CI with agent.eval
You changed a system prompt. It looks better. Does it actually still answer the ten questions you care about, and does it still sound on-brand? The only honest answer is a regression suite that runs your agent against fixed cases and asserts on the outcome — and the natural home for that is the test runner you already have. agent.eval returns an assertion-friendly EvalReport; wrap it in a Vitest test and a prompt regression turns red in CI before it reaches a user.
The scenario
Section titled “The scenario”A docs Q&A agent answers product questions. The eval suite mixes three kinds of check:
- Factual cases with a known answer — graded by an
exact/containsscorer (cheap, deterministic). - Structural cases — graded by a
predicatescorer that asserts on the result shape (no tool errored, output parsed). - Subjective cases (tone, helpfulness) with no single right answer — graded by an LLM judge against a rubric.
The whole thing is one *.spec.ts that the build runs.
yarn add @warlock.js/ai @warlock.js/ai-openai @warlock.js/ai-anthropic @warlock.js/sealyarn add -D vitestCI needs the provider keys in the environment. Gate the suite so local runs without keys skip rather than fail.
The agent under test and the judge
Section titled “The agent under test and the judge”import { ai } from "@warlock.js/ai";import { OpenAISDK } from "@warlock.js/ai-openai";import { AnthropicSDK } from "@warlock.js/ai-anthropic";import { v } from "@warlock.js/seal";
const openai = new OpenAISDK({ apiKey: process.env.OPENAI_API_KEY! });const anthropic = new AnthropicSDK({ apiKey: process.env.ANTHROPIC_API_KEY! });
// The agent we're guarding against regressions.export const docsAgent = ai.agent({ name: "docs-qa", model: openai.model({ name: "gpt-4o-mini" }), systemPrompt: ai.systemPrompt() .persona("You answer questions about the Warlock framework.") .instruction("Be concise. If unsure, say so — never invent an API."),});
// A separate, name-bearing judge agent with a verdict-shaped output schema.// The judge runner reads { score, passed, reason } off result.data.const verdictSchema = v.object({ score: v.number().min(0).max(1), passed: v.boolean(), reason: v.string(),});
const judgeAgent = ai.agent({ name: "eval-judge", model: anthropic.model({ name: "claude-sonnet-4-5" }), output: verdictSchema, systemPrompt: ai.systemPrompt().persona( "You are a strict grader. Score answers 0..1 and return the JSON verdict.", ),});The Vitest suite
Section titled “The Vitest suite”import { describe, expect, it } from "vitest";
const hasKeys = !!process.env.OPENAI_API_KEY && !!process.env.ANTHROPIC_API_KEY;
describe.runIf(hasKeys)("docs-qa eval", () => { it("passes the factual + structural suite", async () => { const report = await docsAgent.eval({ cases: [ { name: "schema-builder", input: "What package provides the v schema builder?", expected: "@warlock.js/seal", scorers: [ai.eval.contains()], }, { name: "execute-never-throws", input: "Does agent.execute throw on a provider error?", expected: "no", scorers: [ai.eval.contains()], }, { name: "no-tool-errors", input: "Summarize how supervisors route between agents.", // A predicate scorer asserts on the result, not the text: // every child report node completed and the agent didn't error. scorers: [ ai.eval.predicate( (ctx) => ctx.result.error === undefined && ctx.result.report.children.every((child) => child.status === "completed"), ), ], }, ], // Suite-wide threshold for scorers that return only a numeric score. passThreshold: 0.5, onFailure: (caseResult) => console.error( `FAILED ${caseResult.case.name} (score ${caseResult.score.toFixed(2)})`, caseResult.scores.map((s) => s.reason).filter(Boolean), ), });
// `report.passed` is true only when EVERY case passed. expect(report.passed).toBe(true); expect(report.passRate).toBe(1); });
it("passes the subjective suite via an LLM judge", async () => { const report = await docsAgent.eval({ cases: [ { name: "tone-unsure", input: "What is the airspeed of an unladen swallow?" }, { name: "tone-helpful", input: "I'm new — where do I start?" }, ], // No per-case scorers → the judge grades every case. judge: { agent: judgeAgent, rubric: "Score 1.0 only if the answer is honest about uncertainty and never invents an API. Penalize confident hallucination heavily.", passThreshold: 0.7, }, });
expect(report.passed).toBe(true); expect(report.meanScore).toBeGreaterThanOrEqual(0.7); });});How scoring resolves
Section titled “How scoring resolves”For each case, the runner resolves scorers in this precedence: the case’s own scorers → the suite-level scorers → the suite judge. A case that can resolve none throws at author time — an eval that can’t score a case is a config bug, not a silent pass. A case passes only when the agent did not error AND every resolved scorer passes. report.passed is the AND across the whole suite, which is exactly what you expect(...).toBe(true).
The built-in scorers, all on ai.eval:
ai.eval.exact()— output equalsexpected(trimmed, case-insensitive; structured values compared by canonical JSON).ai.eval.contains()—expectedappears as a substring of the output.ai.eval.predicate(fn)— wrap any boolean assertion over the fullEvalScorerContext(ctx.result,ctx.output,ctx.text).ai.eval.judge(config)— LLM-as-judge against a rubric.
Production notes
Section titled “Production notes”evalnever throws on a case failure. Failures surface on the report (report.passed, per-casepassed) and throughonFailure— so a single broken case gives you the whole picture, not a stack trace on the first miss. The only author-time throw is the “no scorer resolvable” guard.- Cases run sequentially, in suite order — deliberate, since cases share the agent and may carry side effects. Don’t assume parallelism.
- Give the judge an output schema. The judge runner reads
{ score, passed, reason }fromresult.datafirst, falling back to parsing raw text. A schema-bearing judge (as above) is far more reliable than text parsing; a judge that errors or returns unparseable text scores the case0rather than crashing the suite. - Gate on keys, not on luck.
describe.runIf(hasKeys)keeps the suite green on a contributor’s machine with no API keys while still enforcing it in CI where the keys exist. onFailurethrows are swallowed so a logging bug can’t abort the run — log inside it, don’t depend on it propagating.
Related
Section titled “Related”- Basic agent — the agent shape under evaluation.
- Deterministic supervisor tests — fast, no-LLM tests for multi-agent flows.