The Expertise Paradox of Generative AI

One of the defining characteristics of expertise is not doing more work. It is doing less. A chess grandmaster does not evaluate more moves than a beginner. They evaluate fewer. Years of experience allow them to instantly discard bad branches and focus only on promising ones. The same pattern shows up across science, engineering, medicine, and decision-making.

Expertise is the compression of search. It is the ability to reach the same answer with dramatically less effort.

Which makes the current trajectory of Generative AI worth examining. Much of the frontier appears to be moving in the opposite direction. Context windows are getting longer. Reasoning traces are getting longer. Chains of thought are growing. Inference budgets are climbing. When a model generates ten thousand tokens before producing an answer, we often interpret that as evidence of deeper thinking.

But from the perspective of computer science, it can also look like brute-force search. Explore enough possibilities, and eventually you will find a good solution. A naive backtracking algorithm can do that too. The difference between a naive search algorithm and an intelligent one is not always whether they find the answer. It is how much work they require to get there.

Thinking Longer vs. Thinking Better

It is worth being precise about what this distinction looks like. Two solvers can land on the same answer through very different paths. One sweeps a wide region of the search space and gradually filters down. The other walks almost directly to the answer because most of the space has already been ruled out before search even started.

Two ways to reach the same answer

Same destination. Different amount of work. The token counts are illustrative, not measured.

From the outside, both solvers produce the same answer. From the inside, one is brute force and the other is judgment. The visible artefact, the answer, hides the more important property, which is how much of the problem was eliminated without ever being touched.

So an uncomfortable question follows. Are today's large language models becoming more intelligent, or are they simply becoming more willing to spend computation?

What Operations Research Learned a Long Time Ago

This is not a new question. In Operations Research and Decision Science, the biggest breakthroughs almost never came from searching more. They came from searching less. The history of the field is, in many ways, a history of techniques for avoiding work.

Constraint propagation Eliminates impossible choices before search even begins, shrinking the space the solver ever has to look at.
Bounding functions Terminate hopeless branches early, so the solver never wastes time exploring regions that cannot beat the best-known solution.
Heuristics Guide exploration toward the most promising regions of the solution space, so the first paths tried are usually good ones.
Symmetry breaking Refuses to re-examine solutions that are structurally identical to ones already considered, collapsing whole families of branches into one.

The pattern is the same in every case. The goal was never to search harder. The goal was to search smarter. Brute force was the baseline that the field spent decades climbing away from.

The difference between a naive search and an intelligent one is not whether they find the answer. It is how much they had to look at before they did.

Two Definitions of a Better Model

Today's discourse around Generative AI mostly rewards the first column of the table below. A larger context window, a longer chain of thought, a bigger reasoning budget all feel like progress because they are visibly more. The second column is less celebrated, partly because it is harder to advertise. You cannot demo a smaller token count the way you can demo a longer one.

Two definitions of a better model

Today's Frontier

Scaling computation

Better = more thinking

Longer context. Longer reasoning traces. More chain-of-thought. More inference. The model is judged by how much it can afford to compute before answering.

A Possible Next Step

Scaling intelligence

Better = less unnecessary thinking

Stronger priors. Sharper pruning. Tighter abstractions. The model reaches the same answer with far less work because most of the search space was eliminated before reasoning began.

Both columns improve quality. Only one of them looks like expertise.

The risk in optimising only the first column is that we end up with models that are better at affording computation, not better at avoiding it. They can climb higher because they can buy more steps. That is real progress. But it is not the same kind of progress as a chess grandmaster who simply does not see the bad moves.

What the Next Breakthrough Might Look Like

Perhaps Generative AI will eventually follow the same path the optimisation community followed. The next major breakthrough may not be a model that reasons for one hundred thousand tokens. It may be a model that arrives at the same answer using one thousand. Not because it knows less, but because it understands more.

The shape of that breakthrough is hard to predict, but the direction is recognisable. Better priors, so the model starts in the right neighbourhood. Stronger internal abstractions, so it can collapse many surface variations into a single concept. The ability to recognise, early, that a line of reasoning is going nowhere and stop. The willingness to commit to an answer when the evidence is already sufficient, instead of continuing to think for the sake of appearing thorough.

The benchmark worth watching

Most evaluations today measure capability per task. A more revealing axis would measure capability per token, or per second of inference, on tasks of matched difficulty. The model that wins on accuracy while using a fraction of the compute is the one quietly demonstrating expertise rather than effort.

The Quiet Inversion

The ultimate benchmark of intelligence has rarely been the ability to perform more computation. It has been the ability to avoid unnecessary computation altogether.

Today, model capability is largely measured by how much thinking a model can afford. Tomorrow, it may be measured by how much thinking it can avoid. That is a subtle inversion, but a real one. And it might be the moment when the field stops scaling computation and starts scaling intelligence.

✦

Expertise has always been a quiet thing. It rarely looks impressive in the moment. The grandmaster just plays the move. The doctor just names the diagnosis. The engineer just sees the bug.

None of them show the search that did not happen. That is the part that matters.