Mastering Reinforcement Learning: Dynamic Programming Made Easy (Sutton & Barto Chapter 4)

If you are reading Richard Sutton and Andrew Barto’s classic book, Reinforcement Learning: An Introduction, Chapter 4 introduces a concept that sits at the very foundation of RL: Dynamic Programming (DP).

The name sounds intimidating, but the core idea is surprisingly intuitive. Let’s break it down without drowning in equations first.

The Intuition: Navigating the Grid

Imagine you are playing a board game on a simple grid. Your goal is to reach a treasure chest while avoiding traps. You have a “map” that tells you exactly what happens if you move Up, Down, Left, or Right from any square. In RL terms, having this perfect map means we have a perfect model of the environment.

Dynamic Programming is simply a method for using this perfect map to figure out the absolute best way to play the game. It does this in two main steps: Policy Evaluation and Policy Improvement.

Step 1: Policy Evaluation (Scoring the Board)

Imagine you have a basic strategy (a “policy”)—say, you decide to just move completely randomly. Policy Evaluation is the process of looking at the board and grading each square based on that strategy.

We look at a square and ask: “If I start here and follow my random strategy, how much reward will I eventually get?” Because we know the rules of the game (our map), we can calculate this by looking at the neighboring squares. A square’s score is updated based on the scores of the squares it can lead to. We sweep over the board, updating the scores again and again until they stop changing.

Step 2: Policy Improvement (Updating the Strategy)

Now that every square has an accurate score, we can easily improve our strategy.

If you are standing on a square, you simply look at the neighboring squares and point your arrow (your strategy) toward the one with the highest score. By doing this for every square, your strategy just got better!

Putting it Together: Policy Iteration and Value Iteration

If we alternate between scoring the board (Evaluation) and updating our arrows (Improvement), we are doing Policy Iteration. Eventually, the arrows will lock into the perfect, optimal path.

A faster shortcut is Value Iteration. Instead of fully evaluating a bad policy before improving it, we aggressively combine the steps. We just sweep through the board, updating each square’s score by assuming we will always take the best possible action.

Let’s Code It! (Value Iteration in Python)

Here is a simple implementation of Value Iteration for a 4x4 Gridworld (where the top-left and bottom-right corners are the goals).

import numpy as np

# 4x4 Grid, states 0-15. 0 and 15 are terminal (goals).
num_states = 16
rewards = np.full(num_states, -1) # Every step costs -1
rewards[0] = 0
rewards[15] = 0

# Initialize values to 0
V = np.zeros(num_states)
gamma = 1.0 # Discount factor
theta = 1e-4 # Convergence threshold

def get_next_state(state, action):
    # Actions: 0=Up, 1=Right, 2=Down, 3=Left
    if state == 0 or state == 15: return state # Terminal
    row, col = divmod(state, 4)
    if action == 0: row = max(0, row - 1)
    elif action == 1: col = min(3, col + 1)
    elif action == 2: row = min(3, row + 1)
    elif action == 3: col = max(0, col - 1)
    return row * 4 + col

# Value Iteration
while True:
    delta = 0
    for s in range(1, 15): # Skip terminal states
        v = V[s]
        # Calculate value for all 4 actions and pick the max
        action_values = []
        for a in range(4):
            next_s = get_next_state(s, a)
            action_values.append(rewards[s] + gamma * V[next_s])
        
        V[s] = max(action_values)
        delta = max(delta, abs(v - V[s]))
        
    if delta < theta:
        break

print("Optimal Values:")
print(V.reshape(4, 4))

Optimal Values:
[[ 0. -1. -2. -3.]
 [-1. -2. -3. -2.]
 [-2. -3. -2. -1.]
 [-3. -2. -1.  0.]]

For the Math Enthusiasts: The Equations Behind the Magic

Now that we understand the intuition, let’s look at the formal mathematics from Chapter 4.

DP algorithms rely on the assumption that the environment is a finite Markov Decision Process (MDP). The state, action, and reward sets (\(\mathcal{S}, \mathcal{A}, \mathcal{R}\)) are finite, and the dynamics are governed by a known set of probabilities: \(p(s', r | s, a)\).

1. Policy Evaluation (The Bellman Expectation Equation)

To evaluate an arbitrary policy \(\pi\), we compute the state-value function \(v_\pi\). This is done iteratively using the Bellman equation for \(v_\pi\) as an update rule:

\[v_{k+1}(s) \doteq \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_k(s') \right]\]

\(k\) is the iteration step.
\(\gamma\) is the discount factor (how much we care about future rewards versus immediate ones).
The equation essentially states: The value of a state is the expected immediate reward plus the discounted value of the next state.

2. Policy Improvement

Given the true value function \(v_\pi\) for a policy, we can form a new, better policy \(\pi'\) by acting greedily with respect to \(v_\pi\):

\[\pi'(s) \doteq \arg\max_a \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_\pi(s') \right]\]

3. Value Iteration (The Bellman Optimality Equation)

Instead of waiting for policy evaluation to fully converge, Value Iteration truncates the evaluation step. It directly applies the Bellman Optimality Equation as an update rule:

\[v_{k+1}(s) \doteq \max_a \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_k(s') \right]\]

By constantly forcing the value of a state to equal the value of the maximum possible action, \(v_k\) converges strictly to \(v_*\) (the optimal value function). Once \(v_*\) is found, the optimal policy \(\pi_*\) is simply the greedy policy derived from it.