Lesson 06: Parallelisation — AI Agents Series

The Core Idea

Routing sends each input down one path. Parallelisation does the opposite: it takes one input, breaks it into pieces that don't depend on each other, and works every piece at the same time, then folds the pieces back into a single result. Three moves, always in this order: split the job, run the parts concurrently, merge the outputs.

A newsroom runs on exactly this rhythm. When a budget is announced, the politics reporter, the markets reporter, and the small-business columnist all start writing the moment the news breaks. None of them waits for the others, because their angles don't overlap. An editor then stitches the three pieces into one front page. Nobody filed faster by working alone, and the page is richer for having three independent viewpoints on it.

That last line is the whole bargain. Run the three parts one after another and you wait for all three added together. Run them at once and you wait only for whichever finishes last. Speed is just the obvious win, though. Splitting a job across several agents also buys you independent viewpoints: three angles on the same subject, or three attempts at the same question. That variety often makes the final answer sturdier than any single pass could be.

The Golden Rule: Independence

Parallelisation only works when the parts genuinely don't need each other. If sub-task B can't begin until it has seen sub-task A's output, then A and B aren't parallel at all. B is simply waiting its turn, and running them "together" just means B idles until A is done. That is a chain, not a parallel split, and forcing it into parallel machinery buys nothing.

So before splitting, ask one question of every pair of parts: could this one start without that one's result? If yes for all of them, parallelise. If any part feeds another, keep those two in a chain and parallelise only around them. Real workflows mix both freely.

Why Parallelise?

Sturdier answers

Several independent passes catch what one would miss. Three reviewers each owning one aspect, or five attempts at one question settled by a vote. Diversity is the point, not a side effect.

Wall-clock speed

You wait for the slowest part, not the sum. Since a large language model (LLM) call spends almost all its time waiting on the network, overlapping the waits is close to free.

Focus per part

Each agent sees only its slice, so its prompt stays short and pointed — the same specialisation win as routing, applied to the pieces of one job.

Splitting the Work

How you carve a job up depends on what kind of job it is. Three shapes cover almost everything. They differ in what gets split: the subject's facets, the input's bulk, or the attempt itself.

By aspect: one subject, many angles

When a single thing needs judging on several unrelated counts, give each count its own agent. Picture a recipe being vetted for a weeknight-meals blog: one agent weighs its nutrition, another its cost, a third its difficulty. None of those answers depends on the others, so all three run together, and each agent's prompt stays laser-focused on its own column. This is the shape the code below builds.

By volume: one big input, cut into chunks

When the input is too large to handle in one pass, slice it and run the same task on each slice. A backlog of two thousand customer messages can be split into twenty batches of a hundred, each batch tagged for sentiment by its own agent at the same time. The task never changes from chunk to chunk — only the data does. Watch the seams: a slice cut mid-sentence can lose its meaning, so divide on natural boundaries where you can.

By repetition: one question, many attempts

Sometimes you don't split the input at all. You run the same task several times over and use the spread. Crank the temperature and ask for a tagline five times to get five genuinely different options to choose from. Or ask a yes/no question several times and let the majority answer settle it, smoothing out the occasional odd response. The prompt stays fixed on every run. What changes is the luck of the draw.

Bringing It Back Together

Splitting is only half the pattern. The parts have to become one answer again. The merge step is where parallelisation actually earns its keep, and the right method depends on what the parts are. Four cover the ground, from a one-liner to its own LLM call:

Stitch them together

Just join the outputs in order — the twenty sentiment batches concatenated back into one tagged list. The simplest merge. It fits whenever the parts are pieces of one whole rather than rival answers to one question.

Let them vote

When several agents answered the same question, count the answers and take the most common. A quick, code-only way to turn a handful of independent guesses into one more reliable one.

Score and pick

Generated several rival solutions? Rate each against fixed criteria and keep the best. Unlike voting, you're not tallying matches. You're judging quality and choosing a single winner.

Hand them to a synthesiser

Give all the outputs to one more LLM whose only job is to blend them into one consistent piece, reconciling overlaps and smoothing the joins. The richest merge, and the one the code below uses.

Split and merge are a matched pair. The way you carve the job up dictates how you put it back. By volume usually ends in a stitch. By repetition ends in a vote (for one settled answer) or a score-and-pick (for the best of several). By aspect ends in a synthesiser, because the parts are different views that someone has to reconcile. Choose the two ends together — the same way a router's labels and routes are designed as a pair.

Parallelisation in Code

Below is the by aspect shape end to end: three reviewers judge one recipe on independent counts, run together, then a synthesiser writes the verdict. The blocks form one complete program. Set an OPENAI_API_KEY in your environment and it runs as shown.

Start with the plumbing: a tiny chat() helper over the OpenAI SDK (a prompt in, the model's text reply out), and the one recipe every reviewer will judge.

setup: the chat() helper and the input

import os
from collections import Counter
from concurrent.futures import ThreadPoolExecutor

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def chat(user_prompt, system_prompt="You are a helpful assistant.",
         max_tokens=200, temperature=None):
    """Send one prompt to the model and return its text reply."""
    kwargs = dict(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        max_tokens=max_tokens,
    )
    if temperature is not None:
        kwargs["temperature"] = temperature
    return client.chat.completions.create(**kwargs).choices[0].message.content


RECIPE = """One-Pan Smoked Paprika Chickpeas
Serves 4. Tin of chickpeas, one onion, three cloves garlic, a tin of chopped
tomatoes, smoked paprika, cumin, a handful of spinach, olive oil, crusty bread
to serve. Soften the onion and garlic, stir in the spices, add tomatoes and
chickpeas, simmer twenty minutes, fold through the spinach. Done in half an hour."""

Now the reviewers. Each is an ordinary function with one tight focus — nutrition knows nothing of cost, cost nothing of difficulty — which is exactly what makes them safe to run at the same time:

the independent reviewers

def nutrition_review(recipe):
    """Aspect 1: how healthy and balanced is the dish?"""
    return chat(
        user_prompt=f"Recipe:\n{recipe}",
        system_prompt=(
            "You are a nutrition reviewer. In two or three sentences, judge how "
            "healthy and balanced this dish is -- protein, vegetables, and any "
            "concerns. No preamble."
        ),
        max_tokens=160,
    )


def cost_review(recipe):
    """Aspect 2: how friendly is it to a weekly food budget?"""
    return chat(
        user_prompt=f"Recipe:\n{recipe}",
        system_prompt=(
            "You are a grocery-cost reviewer. In two or three sentences, judge "
            "how cheap this dish is to make and call out the priciest ingredient. "
            "No preamble."
        ),
        max_tokens=160,
    )


def difficulty_review(recipe):
    """Aspect 3: how much skill, time, and kit does it demand?"""
    return chat(
        user_prompt=f"Recipe:\n{recipe}",
        system_prompt=(
            "You are a cooking-difficulty reviewer. In two or three sentences, "
            "judge the skill, time, and equipment this dish needs, and whether a "
            "nervous beginner could manage it. No preamble."
        ),
        max_tokens=160,
    )


# collect all three in one place so the split can loop over them
REVIEWERS = {
    "nutrition": nutrition_review,
    "cost": cost_review,
    "difficulty": difficulty_review,
}

Now the split. Python's ThreadPoolExecutor launches every reviewer at once and waits for them all to come back. Threads are the right tool here precisely because each call spends its time waiting on the network, not doing arithmetic, so the three waits overlap instead of stacking up:

the split: run them concurrently

def review_in_parallel(recipe):
    with ThreadPoolExecutor(max_workers=len(REVIEWERS)) as pool:
        # submit every reviewer at once...
        futures = {aspect: pool.submit(fn, recipe) for aspect, fn in REVIEWERS.items()}
        # ...then collect each result, keyed by its aspect
        reports = {aspect: f.result() for aspect, f in futures.items()}
    return reports

And the merge. The synthesiser is just one more chat() call, handed all three reports at once. Its job is to reconcile and summarise what the reviewers found, not to re-judge the recipe itself:

the merge: a synthesiser LLM

def synthesise(recipe, reports):
    bundle = "\n\n".join(f"[{aspect}]\n{text}" for aspect, text in reports.items())
    return chat(
        user_prompt=f"Recipe:\n{recipe}\n\nReviewer reports:\n{bundle}",
        system_prompt=(
            "You are a food editor. Three reviewers have each judged one aspect "
            "of a recipe. Combine their reports into a single verdict of about "
            "four sentences, ending with a one-line recommendation that starts "
            "with 'Verdict:'. Do not contradict the reviewers."
        ),
        max_tokens=260,
    )

Run it on a one-pan chickpea recipe and the three reviews land together, then the editor folds them into a verdict:

sample output

[nutrition] Well-balanced -- plant protein from chickpeas, fibre and
vitamins from spinach and tomatoes, healthy fats from olive oil. The
crusty bread adds refined carbs depending on how much you eat.

[cost] Very cheap to make; every ingredient is a store-cupboard staple.
The priciest item is the smoked paprika, especially a good imported one.

[difficulty] Simple, one pan, common utensils. A straightforward simmer
with little prep -- well within reach of a nervous beginner.

(three reviewers finished concurrently in 9.9s)

------------------------------------------------------------
  Synthesiser's verdict
------------------------------------------------------------
A nutritious, plant-based dish, generous with protein, fibre and
vitamins, and easy on the wallet -- the smoked paprika is the only
near-premium. It needs one pan and almost no technique, so a beginner
can manage it comfortably.
Verdict: A healthy, budget-friendly, low-effort pick for a busy weeknight.

Notice the timing line: three calls that would take roughly thirty seconds back-to-back returned in about ten. The merge waited only for the slowest reviewer.

The other merge: a vote

Swap the synthesiser for a by repetition split and you get the voting style. Ask one yes/no question several times over, with the temperature turned up so the answers actually vary, then let plain code tally the ballots. No LLM is needed for the merge at all — Counter does it:

repetition + majority vote

def beginner_friendly_vote(recipe, rounds=5):
    def ask(_):
        answer = chat(
            user_prompt=f"Recipe:\n{recipe}",
            system_prompt=(
                "Could a nervous first-time cook make this without help? "
                "Reply with exactly one word: YES or NO."
            ),
            max_tokens=3,
            temperature=1.0,   # spread the answers so the vote means something
        )
        return answer.strip().upper().rstrip(".")

    with ThreadPoolExecutor(max_workers=rounds) as pool:
        ballots = list(pool.map(ask, range(rounds)))
    winner, count = Counter(ballots).most_common(1)[0]
    return winner, count, ballots

sample output

Ballots: ['YES', 'YES', 'YES', 'YES', 'YES']
Majority: YES (5/5)

A clean sweep here, but the value shows on a borderline subject: where a single call might flip to a stray NO, four YESes against one NO still give you the right answer. That is the whole reason to spend five calls instead of one.

One last piece ties it together: an entry point that runs both merges. Drop the blocks above into a single file in order, add this, and it runs end to end:

run it: the entry point

if __name__ == "__main__":
    reports = review_in_parallel(RECIPE)
    for aspect, text in reports.items():
        print(f"[{aspect}] {text}\n")
    print(synthesise(RECIPE, reports))

    winner, count, ballots = beginner_friendly_vote(RECIPE)
    print(f"\nBallots: {ballots}")
    print(f"Majority: {winner} ({count}/{len(ballots)})")

Where It Goes Wrong

The independence trap. Parallelisation quietly breaks when two parts you treated as independent really aren't. Suppose the cost reviewer needs a figure that the nutrition reviewer was meant to work out first. Run the two side by side and the cost reviewer starts before that figure exists, so it works from nothing and its report is wrong — yet no error is ever raised. The fix is to notice the dependency and put those two in a chain instead of in parallel. Whenever a parallel result looks oddly incomplete, check that the parts really did not need each other.

The merge is where bugs hide. Each part can be perfect and the whole still wrong if you stitch in the wrong order, double-count a vote, or let the synthesiser quietly overrule its own sources. Splitting is the easy half. Treat the merge as the step that actually needs the care, and the testing.

Parallel work is concurrent spend. Five calls at once cost the same tokens as five in a row. You just pay them all in the same instant, and you can hit rate limits faster. Parallelisation trades money and throughput for time and quality, so make sure the part you split was actually worth splitting.

Lesson Recap

What You Now Know

The pattern: split one job into independent parts, run them concurrently, then merge the outputs into a single answer
The golden rule: parts must be independent. If one needs another's output, that pair belongs in a chain, not in parallel
Two payoffs: sturdier answers from independent viewpoints, and wall-clock speed (you wait for the slowest part, not the sum). Focus per part comes along for free
Three ways to split: by aspect (one subject, many angles), by volume (one big input cut into chunks), by repetition (one question, many attempts)
Four ways to merge: stitch together, let them vote, score and pick the best, or hand everything to a synthesiser LLM
Split and merge are a pair: by volume → stitch, by repetition → vote or score-and-pick, by aspect → synthesise. Design both ends together
Concurrency is cheap here: an LLM call is mostly waiting on the network, so a thread pool overlaps the waits for almost no extra cost
Mind the merge and the meter: most bugs live in how parts recombine, and running parts at once spends their tokens at once
Relation to earlier patterns: chaining runs one thing through ordered steps, routing picks one path, and parallelisation runs many paths at once to reconcile them