Bellman Equation — DO Concepts

← Back to 27 Apr 2026 issue

Why it’s needed

Every sequential optimisation problem has the same structure: you make a decision now, the world moves to a new state, then you decide again. Repeat for dozens, hundreds, or thousands of steps. The puzzle is that the right decision right now depends on every decision that follows, so you cannot reason about step 15 without knowing the entire tail of optimal decisions from step 16 onward.

There are two obvious ways to tackle this, and both fail at scale:

✗ ENUMERATE ALL PATHS

Try every possible sequence of decisions from the current state to the end of the horizon. Finds the global optimum but grows exponentially with the number of decisions and branching factor. For a 20-over cricket innings with 10 bowlers, the sequence space has more branches than atoms in the observable universe.

✗ GREEDY ONE-STEP

Choose the best available decision right now, ignoring future steps. Cheap to compute but routinely gives globally poor results. A cricket captain who plays the best available bowler each over independently will misallocate specialist bowlers and waste their limited quota at the wrong moment.

Both fail for the same reason: they conflate the decision at one step with the full consequence of that decision over all future steps. The Bellman equation breaks this coupling by encoding the future consequence as a number — the value of the next state — that can be computed once and reused across every decision that could lead there.

The hard problem is not choosing the best action at a single step. It is evaluating the long-run cost of each action without re-computing the entire future every time.

What it does

The Bellman equation introduces a value function (a number attached to each state that summarises the best cumulative reward achievable from that state onward) and a recursion (a rule that computes the value of any state from the values of the states it can transition into, without looking further).

Once the value function is known, the optimal decision at any state is immediate: pick the action that maximises immediate reward plus the (discounted) value of the resulting state. The entire burden of planning ahead is pre-computed into the value function.

Define states. Identify every quantity that determines which decisions are still available and what outcomes remain possible. In a cricket innings: current over, score gap, wickets remaining, which bowlers still have overs left.
Define terminal values. At the end of the horizon (last over, final period, end of contract), assign a value to every possible terminal state. In cricket: win probability = 1 if the required run rate is achieved, 0 otherwise.
Work backwards. Starting from the terminal states, compute the value of each preceding state using the Bellman recursion: best immediate reward plus discounted value of the next state. This is called backward induction or value iteration.
Read off the policy. At any state, the optimal action is whichever maximises the right-hand side of the equation. No further computation is needed at decision time.

Why "discounted"? The discount factor (written γ, a number between 0 and 1) makes future rewards worth less than immediate ones. For finite-horizon problems (like a cricket innings or a fiscal quarter), you can set γ = 1 and the equation still works. For infinite-horizon problems (like an inventory policy that continues indefinitely), γ < 1 is required to prevent the sum of future rewards from growing without bound.

The key economy is this: each state's value is computed once. A state reachable by many different paths of past decisions is evaluated only by its own future, not by any of those paths. This is the property Richard Bellman called optimal substructure, and it is what converts an exponential enumeration problem into a linear scan over states.

Core idea

The formal statement connects the pieces named above into a single equation. The symbols map directly to the plain-English vocabulary:

The Bellman Equation (deterministic, finite-horizon) V(s) = max a \in A(s) [ R(s, a) + γ \cdot V(s') ]

V(s) → value of being in state s (the number that summarises the best total reward from s onward)

a ∈ A(s) → any action available in state s

R(s, a) → immediate reward for taking action a in state s

γ → discount factor (how much future reward is worth relative to immediate reward)

s′ → the state reached after taking action a from state s

V(s′) → value of the state reached (already computed, because backward induction processed s′ before s)

In the plain-English vocabulary: the value of your current state equals the immediate payoff of the best action, plus the discounted value of wherever that action takes you. The max operator chooses the best action; the recursion handles all future consequences automatically.

For stochastic transitions (where action a leads to several possible next states with known probabilities), the equation becomes an expectation:

Bellman Equation (stochastic) V(s) = max a [ R(s, a) + γ \cdot \sum s' P(s' | s, a) \cdot V(s') ]

In plain English: when the next state is uncertain, replace the single next-state value with the probability-weighted average of all possible next-state values. The T20 cricket paper approximates this expectation using Monte Carlo simulation (random sampling of 50,000 possible innings trajectories per state), because the full state space is too large to enumerate exactly.

The Bellman recursion: V(s) is the maximum over all actions of immediate reward R(s,a) plus discounted next-state value γV(s′). The value of the current state is self-consistently determined by the values of its successor states.

The self-consistency point. The equation defines V(s) in terms of V(s′), which itself satisfies the same equation. This circularity is resolved by backward induction: start from terminal states (where V is given directly) and work backwards, so that V(s′) is always already known when V(s) is computed. When states form a cycle (as in infinite-horizon problems), algorithms like value iteration start with an arbitrary guess for V and update it repeatedly until it converges to a fixed point that satisfies the equation everywhere simultaneously.

Concrete example: T20 cricket bowling plan

Scenario — sports analytics

The situation. A T20 cricket captain must decide which of ten available bowlers to use in each of 20 overs, subject to the constraint that no bowler can bowl more than four overs. A bowler's effectiveness depends on the game state: over number, current score gap, wickets remaining, and which bowlers still have quota. The sequential structure is exactly the Bellman setting.

Without the equation. The captain applies a greedy rule: use the best available bowler this over, treating each over independently. The specialist death-over bowler (best in overs 17 to 20) gets used in over 12 because conditions seem favourable, leaving overs 17 to 20 to weaker options. The plan looks locally optimal at every step and is globally wasteful.

With the equation. Define the state: (current over, score gap, wickets lost, bowlers with remaining quota). Terminal values: the win probability at the end of over 20, computed from the score gap and run rate required. Working backwards from over 20, each state's value is the best immediate payoff (win-probability contribution of the chosen bowler in this over) plus the value of the resulting state in the next over. Because the death-over specialist's quota appears explicitly in the state, the equation accounts for its depletion: using the specialist in over 12 reduces the value of states in overs 13 to 20, and that cost appears in the max over over-12 actions.

The T20 cricket paper does exactly this. It approximates the Bellman equation using 50,000 Monte Carlo innings trajectories per state to estimate win probabilities, then applies backward induction over the state space. The resulting optimal policy improved Mumbai Indians' win probability by 4.1 percentage points (52.4% to 56.5%) on a real 2026 Indian Premier League (IPL) match, compared to the greedy baseline.

The symbols, for the record. State s = (over, score gap, wickets, bowler-quota vector). Action a = identity of the bowler to assign this over. R(s, a) = estimated win-probability contribution of bowler a in state s (estimated from historical ball-by-ball data). V(s′) = value of the resulting state, computed recursively. The discount factor γ = 1 (no time preference within an innings).

Why the naïve alternatives fail

THE GREEDY RULE

At each step, choose the action with the best immediate payoff, ignoring future states entirely.

Catch: Works only if all actions are fully substitutable across time. The moment using a scarce resource now depletes it for a later step where it matters more, the greedy rule fails systematically. In the cricket example: 4.1 percentage points of win probability lost on a single match.

FORWARD SIMULATION (rollout)

From the current state, simulate many random or heuristic future trajectories, average the outcomes, and pick the action whose average is best.

Catch: The simulated future depends on the same greedy or heuristic policy used to generate it. The quality of the action evaluation is bounded by the quality of the simulation policy. Improving the policy requires re-running all simulations, which is expensive and circular.

FULL TREE SEARCH

Enumerate every possible future sequence, score each leaf, and propagate the best scores back to the current node.

Catch: The number of paths grows as branches^depth. For a 20-over innings with 10 bowler options per over, full search has 10²⁰ leaves. The Bellman equation reduces this to the number of distinct states (thousands, not sextillions) by computing each state's value once regardless of how many paths lead to it.

Why practitioners misread this

Misread 1 — dynamic programming means exhaustive enumeration

The tempting reading: dynamic programming (DP) tries all possible futures, so it is just a fancy name for brute-force search.

What is actually true: DP only eliminates redundant computation when the same state can be reached by many different paths, and its value needs computing only once. This property is called optimal substructure. Without it DP does not apply, but when it does hold, the number of states (not the number of paths) determines the computation. In the cricket example, there are millions of possible bowler-assignment sequences but only thousands of distinct game states. The Bellman equation collapses path-counting into state-counting, which is the entire source of its efficiency.

Misread 2 — the Bellman equation is one specific algorithm

The tempting reading: papers that mention "Bellman" are using a particular algorithm that may or may not suit the problem at hand.

What is actually true: the Bellman equation is a mathematical consistency condition, not an algorithm. Value iteration, policy iteration, Q-learning, and A* search are all algorithms that exploit it in different ways, for different state-space sizes and different transition-model assumptions. A paper that uses value-function notation and backward induction is using the Bellman equation whether or not it names it explicitly. Identifying which computational method a paper uses (tabular DP, approximate DP, model-free reinforcement learning) is separate from recognising that all of them rest on the same underlying recursion.

Misread 3 — it only applies to finite, discrete state spaces

The tempting reading: the equation requires listing all states explicitly, so it only works when states are finite in number and discrete in structure.

What is actually true: the Bellman equation extends to continuous states (the Hamilton-Jacobi-Bellman partial differential equation, used in continuous control and robotics), infinite horizons (discounted-sum formulations, standard in economics and reinforcement learning), and stochastic transitions with continuous probability distributions (used directly in the T20 cricket paper via Monte Carlo approximation). Inventory optimisation, energy storage dispatch, and clinical trial dosing all use continuous or infinite-horizon Bellman formulations in production. The finite discrete case is the simplest to explain but is not the boundary of the technique.

Where this shows up in practice

Sports Analytics

Batting order and bowling plan in cricket, lineup construction in basketball, penalty kick strategy in football: any sport with repeated decisions over a finite sequence of states with uncertain outcomes.

Inventory & Supply Chain

Multi-period stock replenishment policies, where each period's order depends on current inventory level and uncertain future demand. The Bellman equation captures the trade-off between ordering now (higher immediate cost) and stocking out later (higher future cost).

Energy Dispatch

Charging and discharging decisions for battery storage over time, where the value of stored energy depends on uncertain future electricity prices. The Bellman value function encodes the option value of holding stored energy.

Reinforcement Learning

Q-learning, deep Q-networks, and policy gradient methods for scheduling, routing, and resource allocation are direct computational implementations of the Bellman equation over large or continuous state spaces where exact tabular computation is infeasible.

Finance

American option pricing (the right to exercise at any time), portfolio rebalancing over multiple periods, and optimal stopping problems all reduce to Bellman equations over stochastic state processes.

Healthcare

Adaptive clinical trial design (adjust dosing based on observed patient response), treatment sequencing for chronic disease, and resource allocation across patient cohorts over time.

The diagnostic question: can you define a state such that the optimal decision from that state depends only on the state, and not on the path taken to reach it? If yes, the Bellman equation applies.

The Bellman principle

Bellman's principle of optimality: the future only cares about where you are now, not how you got there. Once that is true, evaluating the past is waste; only the state value matters.

Related concepts