import numpy as np
class SimpleGridWorldMDP:
def __init__(self, size=4):
self.size = size
self.states = [(i, j) for i in range(size) for j in range(size)]
self.actions = ['up', 'down', 'left', 'right']
self.goal = (size-1, size-1)
def get_next_state(self, state, action):
= state
i, j if action == 'up':
= max(0, i-1)
next_i = j
next_j elif action == 'down':
= min(self.size-1, i+1)
next_i = j
next_j elif action == 'left':
= i
next_i = max(0, j-1)
next_j else: # right
= i
next_i = min(self.size-1, j+1)
next_j return (next_i, next_j)
def get_reward(self, state, action, next_state):
if next_state == self.goal:
return 1.0
return -0.1 # small negative reward for each step
Introduction
Markov Decision Processes (MDPs) form the theoretical backbone of modern reinforcement learning, including deep reinforcement learning. In this post, we’ll explore how MDPs provide the mathematical framework for understanding complex decision-making problems and how they translate into practical deep RL applications.
What is a Markov Decision Process?
An MDP is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. The “Markov” property states that the future only depends on the current state, not the history of how we got there.
Formally, an MDP consists of:
- A set of states
- A set of actions
- A transition function
- A reward function
- A discount factor
The Role of MDPs in Deep Reinforcement Learning
Deep RL extends traditional RL by using deep neural networks to approximate key functions in the MDP framework. Let’s look at how this works in practice.
Value Function Approximation
In deep RL, we often use neural networks to approximate the value function
import torch
import torch.nn as nn
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super(DQN, self).__init__()
self.network = nn.Sequential(
64),
nn.Linear(state_dim,
nn.ReLU(),64, 64),
nn.Linear(
nn.ReLU(),64, action_dim)
nn.Linear(
)
def forward(self, x):
return self.network(x)
# Example usage
= 4 # for a simple environment
state_dim = 2 # two possible actions
action_dim = DQN(state_dim, action_dim) model
Policy Gradient Methods
Policy gradient methods directly learn the policy
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(PolicyNetwork, self).__init__()
self.network = nn.Sequential(
128),
nn.Linear(state_dim,
nn.ReLU(),128, action_dim),
nn.Linear(=-1)
nn.Softmax(dim
)
def forward(self, x):
return self.network(x)
Practical Considerations
When implementing deep RL algorithms based on MDPs, several practical considerations come into play:
- State Representation: The state space must be represented in a way that neural networks can process effectively.
- Reward Design: The reward function should guide the agent toward desired behavior while maintaining the Markov property.
- Experience Replay: To break temporal correlations and improve sample efficiency, we store transitions
in a replay buffer.
class ReplayBuffer:
def __init__(self, capacity):
self.capacity = capacity
self.buffer = []
self.position = 0
def push(self, state, action, reward, next_state, done):
if len(self.buffer) < self.capacity:
self.buffer.append(None)
self.buffer[self.position] = (state, action, reward, next_state, done)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
= random.sample(self.buffer, batch_size)
batch = zip(*batch)
state, action, reward, next_state, done return state, action, reward, next_state, done
Conclusion
MDPs provide the theoretical foundation for deep RL by formalizing the interaction between an agent and its environment. Understanding this framework is crucial for developing and implementing effective deep RL algorithms. As the field continues to advance, the principles of MDPs remain central to new developments in deep RL.