Lesson 07: Evaluator-Optimiser

The Core Idea

Ask an agent for something that genuinely has to hold up — the kind of output where a plausible-but-flawed first draft is worse than none at all — and the result usually lands somewhere between promising and not-quite-right. The evaluator-optimiser pattern treats that first attempt as raw material rather than a finished answer. It puts two agents to work, one that produces the output and one that judges it, then loops them: produce, judge, improve, judge again, until the output meets an agreed standard.

A high-jump coach runs exactly this loop. The athlete takes a jump. The coach watches it against what good technique looks like and gives one precise instruction, "plant your foot a hand's width further from the bar", not a vague "jump better". The athlete jumps again with that single change in mind, and they keep going until the bar is cleared or the session ends. The athlete owns the jump, the coach owns the eye. The athlete improves, the coach never jumps.

The Two Roles

The Generator (the maker)

Takes the brief and produces a draft. On every later round it takes the brief, its own last draft, and the feedback, and produces a better one. It owns the work and nothing else.

The Evaluator (the critic)

Holds the standard. It reads the draft against a fixed set of criteria and returns one of two things: the work passes, or precisely what to change. It owns the judgement and never touches the work itself.

Keep the maker and the critic apart. The temptation is to have one agent write and grade in a single prompt. It rarely works. An agent marking its own homework tends to wave it through, because the same reasoning that produced the draft also decides it is fine. Two separate prompts, ideally with the evaluator told to be sceptical, give you a critic that actually pushes back.

Why a Loop Beats One Shot

A single pass asks the model to get everything right at the same time: correct, complete, on-brief, and inside every constraint. That is a lot to juggle, and whatever it gets wrong ships unnoticed. Splitting the job into produce then check lets each step stay simple, and it turns a mistake into something the next round can fix rather than something the reader discovers. With every loop the draft moves closer to the bar, which is why the pattern shines on work where being roughly right is not good enough.

What Makes the Loop Work

The pattern lives or dies on three design choices. Get them right and the loop converges quickly. Get them wrong and it either passes weak work or never stops.

1. Criteria you can check

The evaluator is only as good as the standard it holds. A goal like "make it good" gives a vague verdict. Spell out concrete, checkable criteria instead: a word limit, the sections that must appear, a banned-jargon list, a factual-accuracy check. The clearer the bar, the sharper the critique.

2. Feedback the maker can use

"Not good enough" tells the generator nothing. Good feedback names the failing criterion and says what to do: "cut it to under 50 words" beats "too long". The generator folds that, the brief, and its last draft into the next attempt.

3. A way to stop

Every loop needs an exit. Stop when the draft passes every criterion, or after a set number of rounds, or once each round barely improves on the last. Without a stop, a fussy evaluator loops on forever and quietly runs up the bill.

Evaluator-Optimiser in Code

Below is the whole loop on one small task: write a short explainer about black holes for a twelve-year-old. The generator drafts it, the evaluator checks it against a rubric, and the loop refines until it passes. The blocks form one complete program. Set an OPENAI_API_KEY and it runs as shown.

Start with the plumbing: a tiny chat() helper over the OpenAI SDK, the brief, and the rubric the evaluator will hold the draft to.

setup: the chat() helper, the brief, the criteria

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def chat(user_prompt, system_prompt="You are a helpful assistant.",
         max_tokens=300, temperature=None):
    kwargs = dict(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        max_tokens=max_tokens,
    )
    if temperature is not None:
        kwargs["temperature"] = temperature
    return client.chat.completions.create(**kwargs).choices[0].message.content


BRIEF = "Explain what a black hole is to a curious 12-year-old."

CRITERIA = """- 50 words or fewer.
- Includes one everyday analogy a child would recognise.
- Plain language only: no jargon such as 'singularity', 'spacetime',
  'gravitational', 'density', or 'escape velocity'.
- Friendly and accurate."""

The generator is two functions. generate makes the first draft from the brief alone. refine makes every draft after that, and it sees three things: the brief, its own previous draft, and the evaluator's feedback. Notice the generator is never handed the full rubric — it does not need it. The evaluator is what enforces the standard.

the generator: write, then rewrite

def generate(brief):
    return chat(
        user_prompt=brief,
        system_prompt="You are a science writer for children. Write a short, clear explanation.",
        max_tokens=200,
        temperature=0.7,
    )


def refine(brief, draft, feedback):
    return chat(
        user_prompt=(
            f"Task: {brief}\n\n"
            f"Your previous draft:\n{draft}\n\n"
            f"The editor's feedback:\n{feedback}\n\n"
            "Rewrite the explanation so it addresses every point."
        ),
        system_prompt="You are a science writer for children. Improve the draft using the feedback.",
        max_tokens=200,
        temperature=0.7,
    )

The evaluator is one function holding the rubric. It is told to reply with a bare PASS when every criterion is met, or REVISE plus a bullet per failing point. That fixed shape is what the loop reads to decide whether to stop. Its temperature is 0.0, because judging should be steady, not creative.

the evaluator: check against the rubric

def evaluate(brief, draft):
    return chat(
        user_prompt=f"Task: {brief}\n\nCriteria:\n{CRITERIA}\n\nDraft to assess:\n{draft}",
        system_prompt=(
            "You are a strict editor. Check the draft against every criterion. "
            "If all are met, reply with exactly 'PASS'. Otherwise reply 'REVISE' "
            "on the first line, then one short bullet per failing criterion "
            "saying exactly what to change."
        ),
        max_tokens=200,
        temperature=0.0,
    ).strip()

The loop ties them together. Generate once, then alternate evaluate and refine. Two things end it: a PASS, or hitting the round limit. That round limit is the safety net — the stopping condition that guarantees the loop is finite even if the draft never quite satisfies a picky evaluator.

the loop: generate, critique, refine, stop

def run(brief, max_rounds=3):
    draft = generate(brief)
    for round_no in range(1, max_rounds + 1):
        print(f"--- draft {round_no} ({len(draft.split())} words) ---\n{draft}\n")
        verdict = evaluate(brief, draft)
        if verdict.upper().startswith("PASS"):
            print(f"[evaluator] PASS on round {round_no}")
            return draft
        print(f"[evaluator] {verdict}\n")
        draft = refine(brief, draft, verdict)
    print(f"[stop] reached the {max_rounds}-round limit")
    return draft


if __name__ == "__main__":
    final = run(BRIEF)
    print("\n=== FINAL ===\n" + final)

Run it and watch the draft shrink past the bar over three rounds. The evaluator catches the length and the jargon, the generator fixes exactly those, and the loop stops the moment it passes:

sample output

--- draft 1 (167 words) ---
A black hole is a place in space where gravity pulls so much that even
light can't get out... You can think of it like a giant space vacuum
cleaner... Imagine you press down hard in the middle of a trampoline...
that 'gravity well' is so deep that nothing nearby can escape...

[evaluator] REVISE
- The draft exceeds the 50 words limit. Condense the explanation.
- Remove the term 'gravity well' as it may be considered jargon.

--- draft 2 (54 words) ---
A black hole is a spot in space with extremely strong gravity that not
even light can escape from, making it invisible. It's like a powerful
vacuum cleaner or a deep dent on a trampoline where everything nearby
gets sucked in. We find these by noticing how nearby stars behave.

[evaluator] REVISE
- The draft exceeds the word limit. It should be 50 words or fewer.

--- draft 3 (37 words) ---
A black hole is a space area with such potent gravity that even light
can't escape, making it unseen. It acts like a super-strong vacuum,
pulling everything close in. We discover them by watching nearby stars.

[evaluator] PASS on round 3

Three drafts, each one closer to the standard, with a critic that never wrote a word of the explanation itself. That separation is the whole point.

Where It Goes Wrong

The agreeable evaluator. A critic that is too soft, or that is secretly the same agent as the maker, passes weak drafts on the first look. Then the loop adds cost and delivers nothing. Make the evaluator a separate, sceptical prompt with a concrete rubric, so a real flaw actually triggers a revision.

The loop that never ends. An evaluator that can always find one more nit will never return PASS. The round limit is the backstop, but it is worth watching for a subtler version: two rounds in a row with no real improvement. That flat stretch is your diminishing-returns signal, and a good place to stop early.

Every round costs. A three-round loop is roughly six model calls instead of one. The pattern buys quality with time and tokens, so spend it where quality is the point and skip it where a single decent pass would do.

When to Reach for It

Use the pattern when the output has to clear a high bar and a wrong answer is expensive: contract or financial wording, code that has to run, anything bound by hard constraints. Lean on it, too, when the brief carries rules a single pass tends to drop — such as a strict length, a required structure, or a list of things to avoid. Skip it for quick, low-stakes, or throwaway work, where one good pass is plenty and the extra rounds are not worth the wait or the spend.

Lesson Recap

What You Now Know

The pattern: pair a maker with a critic and loop them — produce then critique then refine — until the output meets an agreed standard
The two roles: the generator owns the work and rewrites it from feedback, the evaluator owns the judgement and never touches the work
Keep them apart: a single agent grading its own output tends to pass it, so make the evaluator a separate, sceptical prompt
Why loop: one pass has to get everything right at once, while produce-then-check keeps each step simple and turns mistakes into something the next round fixes
Three design choices: criteria you can actually check, feedback the maker can act on, and a clear way to stop
Stopping conditions: stop on a pass, on a round limit, or when each round barely improves on the last
The generator need not hold the rubric: the evaluator enforces the standard, which keeps the maker's prompt simple
Mind the cost: every round is more calls, so reserve the loop for high-stakes, constraint-heavy work