You Already Know How This Works
Before this became an artificial intelligence technique, it was how human knowledge has always travelled.
A scientist spends a decade producing a research paper. Everything is in there: the dead ends, the uncertainty, the caveats, the full complexity of what was learned. Then a textbook author reads that paper, and a hundred others, and distils them into a chapter that a university student can absorb in an afternoon. It gains accessibility. The student who reads it cannot reproduce the original research. But she can reason from its conclusions, apply its principles, and pass the core insight on again, perhaps as a lesson to a secondary school class, where it becomes simpler still. A single idea, moving through layers of compression, each layer trading depth for reach.
This is not a loss. It is how knowledge scales. A discovery that lives only in its original paper reaches thousands. A discovery that survives distillation into textbooks, syllabi, and eventually into intuition reaches billions. The teacher does not become less valuable because students exist. The student does not become less real because she learned from a compression rather than the source.
There is, however, a more uncomfortable version of the same process. A student attends every lecture, takes meticulous notes, and accumulates them over a full year. She then uses those notes to teach a friend who never enrolled, never paid tuition, and sits no exams. The friend learns most of what the lectures contained. The university receives nothing. Nobody considers this outrageous at the scale of two people. Scale it to millions of students, automated note-taking, and systematic curriculum extraction, and it starts to feel different. Scale it further to a competing institution using those notes to build a rival course, and it becomes a legal and ethical problem that the original institution has almost no mechanism to address.
That is the logic of knowledge distillation as it seems to be playing out in artificial intelligence. The technique is ancient and legitimate. The economics are rational. The grey zone is wide. At industrial scale, with geopolitical stakes attached, it has become a serious point of tension in the industry.
What Knowledge Distillation Actually Is
The term sounds more arcane than it is. Knowledge distillation is the process of training a small model to mimic the behaviour of a large one. The large model is called the teacher. The small model is called the student. The student never sees the teacher's weights or architecture. It only sees the teacher's outputs on a carefully chosen set of inputs, and learns to reproduce those outputs on its own.
The result is a student model that captures most of the teacher's capability on a specific task distribution, at a fraction of the teacher's size and cost to run. The student is not a general intelligence. On a specific domain, it becomes a narrow expert, cheap to deploy and fast to serve, with its capability bounded by whatever it was trained on.
This is not a controversial or exotic technique. It is standard engineering practice inside every major AI lab. OpenAI uses it to produce smaller, cheaper versions of GPT-4. Anthropic uses it to produce Claude Haiku from Claude Sonnet. Meta uses it throughout the Llama family. The technique is routine, legitimate, and commercially necessary. The controversy arises when the student is trained using a competitor's teacher without permission, at industrial scale, and with the intent to compete with it.
The Enterprise Logic: From Giant Model to Local Intelligence
Over time, the economics of large-scale Generative Artificial Intelligence (GenAI) deployment point naturally toward a version of distillation inside every large enterprise — one that has nothing to do with geopolitics. It follows directly from the economics of token consumption.
An enterprise sending millions of queries to a frontier model Application Programming Interface (API) begins to notice something: most of its queries are not novel. They cluster around a narrow distribution of domain-specific tasks. Contract clause extraction. Supply chain exception classification. Customer query routing. Invoice anomaly detection. The same patterns appear, day after day, at high volume.
Each of those queries costs the same whether it is the first time the model has seen it or the ten-thousandth. The pricing is linear with tokens consumed. There is no amortisation. The organisation pays in full, every time.
The rational economic response to linear token cost at high volume is to build a local model that handles routine queries cheaply, and reserve the giant model for genuinely novel requests. The giant model teaches the student. The student displaces the teacher over time.
The mechanism works in three phases. First, the organisation routes its real queries through the frontier API, logging both queries and responses. Second, once enough query-response pairs have accumulated on the organisation's specific task distribution, a smaller open-source model (Llama, Mistral, Phi, or similar) is fine-tuned on that dataset. Third, the fine-tuned local model handles routine queries. The frontier API is called only when the local model's confidence is low, or when the query falls outside the known distribution.
Click any node to highlight it.
The economics of this shift are significant. Frontier model API calls are priced per token at rates that reflect the cost of running a very large model on expensive hardware with significant margins. A fine-tuned local model running on modest on-premises or reserved cloud infrastructure costs a fraction of that per query, with most of the cost being a one-time fine-tuning investment that amortises rapidly at enterprise query volumes.
The open-source dimension accelerates this materially. The student model architecture is available for free. Llama 3, Mistral, Phi-4, and Qwen provide capable base models that organisations can fine-tune on their own query distributions without building a model from scratch. The frontier API provides the teacher. The open-source ecosystem provides the student. The enterprise provides the domain-specific query data. The combination is a one-time exit from recurring frontier inference costs.
As distillation tooling matures (Amazon Bedrock, which launched Anthropic distillation support in October 2025, now automates the fine-tuning pipeline entirely, generating synthetic training data without requiring manually crafted examples), the expertise required to execute this exit is dropping. Early adopters are enterprises with machine learning teams. Within a few years, this becomes accessible to organisations with minimal AI capability. At that point, frontier model labs face a structural ceiling: their best customers graduate out of recurring revenue precisely because they became heavy enough users to justify doing so.
The Grey Zone: Where Legitimate Practice Becomes Something Else
There is no clean line between legitimate cost optimisation and contested territory here. The grey zone is wide, and from what I can tell, the industry has not drawn it clearly.
Using your own model's outputs to train smaller versions of the same model. Using a competitor's API within terms of service for genuine product development. Anthropic's Bedrock distillation service explicitly enables this within sanctioned boundaries.
An organisation logs all frontier API responses over 18 months. Later uses that dataset to fine-tune a local model for internal use. The terms of service prohibit training competing models on API outputs — but whether an internal, non-redistributed model counts as "competing" is untested. My guess is it probably does not, but this is genuinely unresolved.
Tens of thousands of fraudulent accounts sending specially crafted prompts designed to extract specific capabilities systematically. What DeepSeek, Moonshot AI, and MiniMax are alleged to have done to Claude. An explicit terms of service violation.
The middle column is where most enterprise AI deployments will eventually find themselves. As I understand it, the terms of service of both OpenAI and Anthropic prohibit using their models' outputs to train competing models. The operative word is "competing." An enterprise fine-tuning a local model for its own internal use, not for sale or redistribution, occupies ambiguous territory. I think enforcement against a domestic enterprise customer would be commercially self-defeating for the AI lab and legally uncertain regardless.
The practical reality is that the frontier labs know this grey zone exists, know their customers are approaching it, and are responding by building their own distillation products (Bedrock, Azure AI Foundry) to capture the value of the transition rather than lose it entirely. That is the commercially rational response: if your customers are going to distil anyway, sell them the distillation service.
The Chinese Distillation Campaigns: A Different Problem
What the Chinese labs have been doing is a categorically different activity that shares the same underlying technique. The distinction matters.
Enterprise distillation uses real operational queries from genuine business workflows. The dataset emerges organically from usage. The intention is cost reduction on the organisation's own tasks. The teacher model is used within its intended purpose before the distillation occurs.
Adversarial distillation, as alleged against DeepSeek, Moonshot AI, and MiniMax, is systematic extraction. The queries are not genuine business queries. They are carefully engineered prompts designed to elicit responses that reveal specific model capabilities, covering as much of the capability space as possible in the minimum number of queries. The goal is not cost reduction. It is capability theft: building a model that mimics the teacher across its full general intelligence, not a narrow task distribution.
The enforcement problem is structural. Terms of service bans cannot reach entities outside US jurisdiction, and geographic blocks were routinely circumvented via proxy services. The Frontier Model Forum intelligence-sharing alliance is the latest attempt to close this gap. Whether it holds is an open question.
The deeper problem is that there is no clean legal hook here. AI outputs are not copyrightable, and terms of service only work when you have jurisdiction. My reading is that the real leverage is political: export controls on the hardware needed to scale distilled models, and pressure at the government level rather than the contract level.
Why This Matters for the Economics of Frontier AI
Both versions of distillation, enterprise and adversarial, point toward the same structural outcome for frontier model labs: a narrowing of the addressable market over time.
Enterprise distillation erodes the recurring inference revenue from the lab's best customers. The customers who generate the most token volume are precisely the customers with the strongest economic incentive to distil their way out of that volume. The lab retains them as occasional customers for novel queries, but loses the bulk of their recurring spend.
Adversarial distillation by Chinese labs accelerates the availability of capable open-weight alternatives that further reduce the enterprise's dependence on frontier APIs. If DeepSeek R1 was substantially distilled from OpenAI and Anthropic models, then the Chinese labs used Western frontier compute to produce a free alternative that competes with the very models that trained it. The frontier labs paid for their own displacement.
The giant model is necessary to create the conditions for its own eclipse. It teaches the world how to think, then watches the world stop paying for lessons.
The frontier labs are not passive in this. The distillation-as-a-service move (Bedrock, Azure AI Foundry) captures some of the value. Continued investment in capability that is genuinely difficult to distil (long-context reasoning, multimodal understanding, real-time tool use in agentic settings) creates a moving frontier that keeps the teacher ahead of any student trained on yesterday's outputs. The labs that survive will be those whose general intelligence is genuinely harder to approximate than their narrow task performance.
But the underlying dynamic does not resolve. It intensifies. More capable open-source student models. Better distillation tooling. Lower expertise requirements. More enterprise query data available for fine-tuning. Each of these forces moves in one direction. My guess is the frontier model lab of 2030 will have a smaller share of the recurring inference market than it does today, even if it remains the undisputed frontier of general capability.
There is one counter-force I can imagine: if frontier models find architectures that substantially reduce inference costs, the economic case for distillation weakens considerably. That has not happened yet, and I am not sure when it will.
The other thing I keep thinking about is the business opportunity sitting quietly inside this dynamic. A company that makes this pipeline seamless — gathering the query data, curating it, learning online, and gradually switching routine queries to a local model without the enterprise having to manage any of it — seems like it could do very well. The technology is mostly there. What is missing is the product.
Knowledge distillation is not a threat from outside the system. It is the system working as designed, following economic logic to its natural conclusion. The giant model creates the student. The student displaces the teacher for routine work. The teacher moves to harder problems. That is how intelligence scales in every domain we know. The question is whether the financial architecture of the current frontier model industry was built for a world in which that cycle completes this quickly.
The honest answer is probably not.
References & Citations
Written with assistance from Claude.