# Bad solution (overfitting):
x = [1000.0, 500.0, ...]
w = [0.005, 0.010, ...]
prediction = 1000*0.005 + 500*0.010 = 10.0
# Good solution (generalizes):
x = [0.92, 0.15, ...]
w = [0.89, 0.10, ...]
prediction = 0.92*0.89 + 0.15*0.10 = 0.834Introduction
Imagine you’re scrolling through Netflix, trying to decide what to watch. Netflix shows you personalized recommendations - movies and shows it thinks you’ll love. How does it know? The answer lies in recommender systems, one of the most successful applications of machine learning in our daily lives.
In this guide, we’ll build two types of movie recommender systems from scratch:
- Collaborative Filtering - “People like you also enjoyed…”
- Content-Based Filtering - “Because you watched…”
We’ll start with intuitive explanations anyone can understand, then dive deep into the mathematics and implementation details.
The Recommendation Problem
The Challenge
You run a streaming service with: - 10,000 movies - 100,000 users - Most users have only rated ~50 movies
The Big Question: How do you recommend the right movies to each user from the remaining 9,950+ unwatched movies?
The Data
Here’s what a typical ratings matrix looks like:
User1 User2 User3 User4 User5
Toy Story 5.0 ? 4.0 ? 5.0
Inception ? 3.5 ? 4.5 ?
Shrek 4.0 5.0 2.0 ? 4.5
The Matrix ? ? 5.0 5.0 ?
Frozen 3.0 4.5 ? 2.0 ?
Where: - Numbers = ratings (1-5 stars) - ? = movie not watched/rated (this is ~95% of the matrix!)
Goal: Fill in the ? marks with accurate predictions!
Collaborative Filtering
Intuitive Explanation
The Core Idea
Remember when you asked a friend for a movie recommendation and they said:
“Well, we both loved Inception and Interstellar, so you’ll probably love Tenet too!”
That’s collaborative filtering! It finds users with similar tastes and uses their ratings to make predictions.
A Simple Story
Let’s say we have three friends:
Alice: - Loved: The Matrix (5⭐), Inception (5⭐) - Hated: The Notebook (1⭐)
Bob: - Loved: The Matrix (5⭐), Inception (4.5⭐) - Hated: The Notebook (2⭐)
Charlie: - Loved: The Notebook (5⭐), Titanic (5⭐) - Hated: The Matrix (2⭐)
Now, Alice and Bob have very similar taste (both love sci-fi, hate romance).
If Bob watched “Blade Runner” and gave it 5⭐, we can predict Alice will probably rate it highly too!
The Magic Trick
But here’s where collaborative filtering gets really clever:
We don’t need to know that these movies are “sci-fi” or “romance”!
The algorithm discovers hidden patterns automatically: - Pattern 1: “Sci-fi action lovers” - Pattern 2: “Romance drama fans” - Pattern 3: “Comedy enthusiasts” - And many more subtle patterns…
Technical Deep Dive Collaborative Filtering
Mathematical Foundation
Collaborative filtering learns two sets of vectors:
For each movie i: - Feature vector x^(i) ∈ ℝ^n (e.g., n=10 dimensions)
For each user j: - Parameter vector w^(j) ∈ ℝ^n - Bias term b^(j) ∈ ℝ
Prediction formula:
ŷ^(i,j) = w^(j) · x^(i) + b^(j)
Where · denotes the dot product (element-wise multiplication and sum).
What Do These Vectors Mean?
Think of x^(i) as the “DNA” of movie i:
x^(Inception) = [0.92, 0.15, 0.78, 0.05, 0.31, ...]
↑ ↑ ↑ ↑ ↑
feature feature feature feature ...
1 2 3 4
The algorithm might learn that: - Feature 1 = “mind-bending complexity” - Feature 2 = “romance level” - Feature 3 = “action intensity” - etc.
Think of w^(j) as user j’s “taste profile”:
w^(Alice) = [0.89, 0.10, 0.82, 0.08, 0.25, ...]
↑ ↑ ↑ ↑ ↑
likes dislikes likes dislikes neutral
complex romance action slow about
plots films films feature 5
The Cost Function
We want to minimize the squared prediction error across all known ratings:
J(X, W, b) = 1/2 ∑_{(i,j):r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j))²
+ λ/2 ∑_{j=1}^{n_u} ∑_{k=1}^{n} (w_k^(j))²
+ λ/2 ∑_{i=1}^{n_m} ∑_{k=1}^{n} (x_k^(i))²
Where: - r(i,j) = 1 if user j rated movie i, 0 otherwise - y^(i,j) = actual rating - First term = prediction error - Second & third terms = regularization (prevents overfitting) - λ = regularization parameter
Why Regularization?
Without regularization, the model could “cheat”:
oth achieve similar training predictions, but the second generalizes better to new data!
Optimization with Gradient Descent
We use gradient descent to find optimal X, W, and b:
Repeat until convergence:
1. Compute gradients: ∂J/∂X, ∂J/∂W, ∂J/∂b
2. Update parameters:
X := X - α * ∂J/∂X
W := W - α * ∂J/∂W
b := b - α * ∂J/∂b
Where α is the learning rate.
The partial derivatives are:
∂J/∂x_k^(i) = ∑_{j:r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j)) * w_k^(j) + λ*x_k^(i)
∂J/∂w_k^(j) = ∑_{i:r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j)) * x_k^(i) + λ*w_k^(j)
∂J/∂b^(j) = ∑_{i:r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j))
Collaborative filtering - Full Implementation from Scratch
import numpy as np
import pandas as pd
from typing import Tuple, List
import matplotlib.pyplot as plt
class CollaborativeFilteringRecommender:
"""
Matrix Factorization based Collaborative Filtering Recommender
Learns latent features for movies and users to predict ratings.
"""
def __init__(self, n_features: int = 10, learning_rate: float = 0.01,
lambda_reg: float = 0.1, n_iterations: int = 1000):
"""
Initialize the recommender system.
Parameters:
-----------
n_features : int
Number of latent features to learn
learning_rate : float
Step size for gradient descent
lambda_reg : float
Regularization parameter
n_iterations : int
Number of training iterations
"""
self.n_features = n_features
self.learning_rate = learning_rate
self.lambda_reg = lambda_reg
self.n_iterations = n_iterations
# These will be learned
self.X = None # Movie features
self.W = None # User parameters
self.b = None # User biases
# Store training history
self.cost_history = []
def _initialize_parameters(self, n_movies: int, n_users: int):
"""
Initialize X, W, and b with small random values.
Using small random values breaks symmetry and helps convergence.
"""
np.random.seed(42)
self.X = np.random.randn(n_movies, self.n_features) * 0.01
self.W = np.random.randn(n_users, self.n_features) * 0.01
self.b = np.zeros((1, n_users))
def _compute_cost(self, Y: np.ndarray, R: np.ndarray) -> float:
"""
Compute the collaborative filtering cost function.
J = 1/2 * sum of squared errors + regularization
Parameters:
-----------
Y : np.ndarray, shape (n_movies, n_users)
Ratings matrix
R : np.ndarray, shape (n_movies, n_users)
Indicator matrix (1 if rated, 0 otherwise)
Returns:
--------
cost : float
The cost function value
"""
n_movies, n_users = Y.shape
# Predictions for all movie-user pairs
predictions = np.dot(self.X, self.W.T) + self.b
# Squared error (only for rated movies)
errors = (predictions - Y) * R
squared_error = 0.5 * np.sum(errors ** 2)
# Regularization terms
reg_X = (self.lambda_reg / 2) * np.sum(self.X ** 2)
reg_W = (self.lambda_reg / 2) * np.sum(self.W ** 2)
# Total cost
cost = squared_error + reg_X + reg_W
return cost
def _compute_gradients(self, Y: np.ndarray, R: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
Compute gradients of the cost function.
Returns:
--------
grad_X : np.ndarray
Gradient with respect to X
grad_W : np.ndarray
Gradient with respect to W
grad_b : np.ndarray
Gradient with respect to b
"""
# Predictions
predictions = np.dot(self.X, self.W.T) + self.b
# Error term
error = (predictions - Y) * R
# Gradients
grad_X = np.dot(error, self.W) + self.lambda_reg * self.X
grad_W = np.dot(error.T, self.X) + self.lambda_reg * self.W
grad_b = np.sum(error, axis=0, keepdims=True)
return grad_X, grad_W, grad_b
def fit(self, Y: np.ndarray, R: np.ndarray, verbose: bool = True):
"""
Train the collaborative filtering model.
Parameters:
-----------
Y : np.ndarray, shape (n_movies, n_users)
Ratings matrix
R : np.ndarray, shape (n_movies, n_users)
Indicator matrix (1 if rated, 0 otherwise)
verbose : bool
Whether to print progress
"""
n_movies, n_users = Y.shape
# Initialize parameters
self._initialize_parameters(n_movies, n_users)
# Gradient descent
for iteration in range(self.n_iterations):
# Compute cost
cost = self._compute_cost(Y, R)
self.cost_history.append(cost)
# Compute gradients
grad_X, grad_W, grad_b = self._compute_gradients(Y, R)
# Update parameters
self.X -= self.learning_rate * grad_X
self.W -= self.learning_rate * grad_W
self.b -= self.learning_rate * grad_b
# Print progress
if verbose and (iteration % 100 == 0 or iteration == self.n_iterations - 1):
print(f"Iteration {iteration:4d}: Cost = {cost:.4f}")
def predict(self, movie_idx: int = None, user_idx: int = None) -> np.ndarray:
"""
Make predictions.
Parameters:
-----------
movie_idx : int, optional
Specific movie index to predict for
user_idx : int, optional
Specific user index to predict for
Returns:
--------
predictions : np.ndarray
Predicted ratings
"""
# Compute all predictions
all_predictions = np.dot(self.X, self.W.T) + self.b
if movie_idx is not None and user_idx is not None:
return all_predictions[movie_idx, user_idx]
elif movie_idx is not None:
return all_predictions[movie_idx, :]
elif user_idx is not None:
return all_predictions[:, user_idx]
else:
return all_predictions
def recommend_for_user(self, user_idx: int, Y: np.ndarray,
n_recommendations: int = 10) -> List[Tuple[int, float]]:
"""
Get top N movie recommendations for a user.
Parameters:
-----------
user_idx : int
User index
Y : np.ndarray
Original ratings matrix (to exclude already rated movies)
n_recommendations : int
Number of recommendations to return
Returns:
--------
recommendations : List[Tuple[int, float]]
List of (movie_idx, predicted_rating) tuples
"""
# Get predictions for this user
predictions = self.predict(user_idx=user_idx)
# Find movies the user hasn't rated
unrated_mask = (Y[:, user_idx] == 0)
unrated_predictions = predictions.copy()
unrated_predictions[~unrated_mask] = -np.inf
# Get top N
top_indices = np.argsort(unrated_predictions)[::-1][:n_recommendations]
recommendations = [(idx, predictions[idx]) for idx in top_indices]
return recommendations
def plot_cost_history(self):
"""Plot the training cost history."""
plt.figure(figsize=(10, 6))
plt.plot(self.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Training Cost History')
plt.grid(True)
plt.show()# Example Usage
# -------------
def create_sample_data():
"""
Create a sample movie ratings dataset.
Returns:
--------
Y : np.ndarray
Ratings matrix
R : np.ndarray
Indicator matrix
movie_names : List[str]
List of movie names
user_names : List[str]
List of user names
"""
# Sample movies
movie_names = [
"Toy Story (1995)",
"Inception (2010)",
"The Matrix (1999)",
"Shrek (2001)",
"The Notebook (2004)",
"Star Wars (1977)",
"Frozen (2013)",
"The Dark Knight (2008)",
"Titanic (1997)",
"Avengers (2012)"
]
user_names = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
# Ratings matrix (0 = not rated)
Y = np.array([
[5.0, 4.5, 4.0, 5.0, 4.0], # Toy Story
[0.0, 3.5, 5.0, 0.0, 4.5], # Inception
[0.0, 4.0, 5.0, 0.0, 5.0], # The Matrix
[4.0, 5.0, 2.0, 4.5, 3.5], # Shrek
[3.0, 0.0, 1.0, 4.5, 2.0], # The Notebook
[0.0, 5.0, 5.0, 0.0, 5.0], # Star Wars
[3.0, 4.5, 0.0, 5.0, 3.0], # Frozen
[0.0, 4.0, 5.0, 0.0, 4.5], # The Dark Knight
[2.0, 0.0, 1.0, 5.0, 2.5], # Titanic
[0.0, 3.5, 4.5, 0.0, 4.0], # Avengers
])
# Create indicator matrix
R = (Y != 0).astype(int)
return Y, R, movie_names, user_names
# Create sample data
Y, R, movie_names, user_names = create_sample_data()
print("Original Ratings Matrix:")
print("=" * 70)
df = pd.DataFrame(Y, index=movie_names, columns=user_names)
print(df)
print("\n")
# Create and train the model
model = CollaborativeFilteringRecommender(
n_features=5,
learning_rate=0.1,
lambda_reg=1.0,
n_iterations=1000
)
print("Training Collaborative Filtering Model...")
print("=" * 70)
model.fit(Y, R, verbose=True)
print("\n")
# Make predictions
predictions = model.predict()
print("Predicted Ratings Matrix:")
print("=" * 70)
df_pred = pd.DataFrame(
np.round(predictions, 2),
index=movie_names,
columns=user_names
)
print(df_pred)
print("\n")
# Get recommendations for a specific user
user_idx = 0 # Alice
print(f"Top 5 Recommendations for {user_names[user_idx]}:")
print("=" * 70)
recommendations = model.recommend_for_user(user_idx, Y, n_recommendations=5)
for rank, (movie_idx, rating) in enumerate(recommendations, 1):
print(f"{rank}. {movie_names[movie_idx]}: {rating:.2f} stars")Original Ratings Matrix:
======================================================================
Alice Bob Charlie Diana Eve
Toy Story (1995) 5.0 4.5 4.0 5.0 4.0
Inception (2010) 0.0 3.5 5.0 0.0 4.5
The Matrix (1999) 0.0 4.0 5.0 0.0 5.0
Shrek (2001) 4.0 5.0 2.0 4.5 3.5
The Notebook (2004) 3.0 0.0 1.0 4.5 2.0
Star Wars (1977) 0.0 5.0 5.0 0.0 5.0
Frozen (2013) 3.0 4.5 0.0 5.0 3.0
The Dark Knight (2008) 0.0 4.0 5.0 0.0 4.5
Titanic (1997) 2.0 0.0 1.0 5.0 2.5
Avengers (2012) 0.0 3.5 4.5 0.0 4.0
Training Collaborative Filtering Model...
======================================================================
Iteration 0: Cost = 310.8793
Iteration 100: Cost = 8.1080
Iteration 200: Cost = 8.1074
Iteration 300: Cost = 8.1074
Iteration 400: Cost = 8.1073
Iteration 500: Cost = 8.1073
Iteration 600: Cost = 8.1073
Iteration 700: Cost = 8.1072
Iteration 800: Cost = 8.1071
Iteration 900: Cost = 8.1071
Iteration 999: Cost = 8.1070
Predicted Ratings Matrix:
======================================================================
Alice Bob Charlie Diana Eve
Toy Story (1995) 4.39 4.45 4.01 4.87 4.19
Inception (2010) 4.27 3.92 4.75 5.03 4.44
The Matrix (1999) 4.52 4.09 4.86 4.99 4.59
Shrek (2001) 3.73 4.78 2.46 4.74 3.32
The Notebook (2004) 2.92 4.69 1.34 4.73 2.54
Star Wars (1977) 4.76 4.39 4.73 4.91 4.65
Frozen (2013) 3.33 4.36 2.57 4.86 3.20
The Dark Knight (2008) 4.39 4.06 4.69 4.99 4.47
Titanic (1997) 2.63 4.44 1.39 4.80 2.46
Avengers (2012) 4.01 3.96 4.28 5.00 4.15
Top 5 Recommendations for Alice:
======================================================================
1. Star Wars (1977): 4.76 stars
2. The Matrix (1999): 4.52 stars
3. The Dark Knight (2008): 4.39 stars
4. Inception (2010): 4.27 stars
5. Avengers (2012): 4.01 stars
Understanding the Code
Key Components:
- Initialization (
_initialize_parameters)- Uses small random values (×0.01) to break symmetry
- Sets random seed for reproducibility
- Cost Function (
_compute_cost)- Vectorized implementation for speed
- Computes predictions:
X @ W.T + b - Only counts error where
R == 1 - Adds regularization penalties
- Gradients (
_compute_gradients)- Uses matrix operations for efficiency
- Computes partial derivatives for all parameters
- Includes regularization terms
- Training (
fit)- Iteratively updates parameters
- Tracks cost history
- Uses batch gradient descent
- Prediction (
predict)- Can predict for specific movie/user or all pairs
- Simply computes:
X @ W.T + b
- Recommendations (
recommend_for_user)- Excludes already-rated movies
- Sorts by predicted rating
- Returns top N recommendations
Content-Based Filtering
Intuitive Explanation Content-Based Filtering
The Core Idea
Content-based filtering is like having a friend who says:
“You loved Inception, which is a sci-fi thriller with mind-bending plot. Try ‘Shutter Island’ - also a mind-bending thriller!”
Instead of comparing users, we compare items (movies) based on their features.
A Simple Story
Imagine you’re at a video store (remember those?) and you tell the clerk:
You: “I loved The Lord of the Rings!”
Clerk: “Great! That’s fantasy, epic adventure, directed by Peter Jackson. Here are some similar movies…” - The Hobbit (same director, same genre) ✓ - Game of Thrones (fantasy, epic) ✓ - Star Wars (epic adventure) ✓
The clerk is using the content (genre, director, actors) to recommend.
How It Works
Step 1: Describe each movie with features
Inception:
- Genre: Sci-Fi, Action, Thriller
- Director: Christopher Nolan
- Lead Actor: Leonardo DiCaprio
- Year: 2010
- Rating: PG-13
- Keywords: dreams, reality, heist
Step 2: Convert to numbers (feature vector)
Inception = [
0.8, # sci-fi score
0.7, # action score
0.9, # thriller score
0.2, # romance score
0.1, # comedy score
]
Step 3: Build user profile from their history
If you watched and loved: - Inception (sci-fi: 0.8, action: 0.7, thriller: 0.9) - The Matrix (sci-fi: 0.9, action: 0.8, thriller: 0.6) - Interstellar (sci-fi: 1.0, action: 0.3, thriller: 0.4)
Your profile (average):
You = [
0.9, # loves sci-fi!
0.6, # likes action
0.63, # likes thrillers
0.1, # doesn't like romance
0.05, # doesn't like comedy
]
Step 4: Find similar movies
Calculate similarity between your profile and all unwatched movies!
Technical Deep Dive Content based filtering
Feature Extraction
For content-based filtering, we need to extract features from movies. Common approaches:
1. Manual Features:
features = {
'genre_scifi': 1 if 'Sci-Fi' in genres else 0,
'genre_action': 1 if 'Action' in genres else 0,
'director_nolan': 1 if director == 'Christopher Nolan' else 0,
'year': (year - 1900) / 100, # Normalized
'runtime': runtime / 200, # Normalized
# ... many more
}2. TF-IDF (Term Frequency-Inverse Document Frequency):
For text features (plot summaries, keywords):
TF-IDF(term, document) = TF(term, document) × IDF(term)
TF(term, doc) = (# times term appears in doc) / (# total terms in doc)
IDF(term) = log(N / (# documents containing term))
This gives higher weight to unique, descriptive words.
3. Embeddings:
Modern approaches use neural networks to learn dense representations: - Word2Vec for text - Image embeddings for posters - Audio embeddings for soundtracks
Similarity Metrics
1. Cosine Similarity (Most Common)
similarity(A, B) = (A · B) / (||A|| × ||B||)
= cos(θ)
Where: - A · B = dot product - ||A|| = L2 norm (magnitude) of vector A - θ = angle between vectors
Properties: - Range: [-1, 1] (usually [0, 1] for non-negative features) - 1 = identical direction - 0 = orthogonal (no similarity) - -1 = opposite direction
Example:
movie_A = np.array([0.9, 0.8, 0.1, 0.2])
movie_B = np.array([0.8, 0.7, 0.2, 0.3])
# Cosine similarity
similarity = np.dot(movie_A, movie_B) / (
np.linalg.norm(movie_A) * np.linalg.norm(movie_B)
)
# similarity ≈ 0.98 (very similar!)2. Euclidean Distance
distance(A, B) = sqrt(∑(A_i - B_i)²)
Smaller distance = more similar.
3. Pearson Correlation
correlation(A, B) = cov(A, B) / (σ_A × σ_B)
Measures linear relationship between vectors.
User Profile Construction
Simple Average:
profile = (1/n) ∑_{i=1}^n item_features_i
Weighted Average (by ratings):
profile = ∑_{i=1}^n (rating_i × item_features_i) / ∑_{i=1}^n rating_i
Preference Learning:
More sophisticated: learn a weight vector w that predicts ratings:
predicted_rating(item) = w · item_features
Use linear regression or other ML algorithms to learn w.
Content BAsed Filtering Full Implementation with TF-IDF and Cosine Similarity
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')
class ContentBasedRecommender:
"""
Content-Based Filtering Recommender using TF-IDF and Cosine Similarity.
Recommends items similar to what the user has liked in the past.
"""
def __init__(self):
"""Initialize the recommender."""
self.tfidf = None
self.tfidf_matrix = None
self.movie_indices = None
self.movies_df = None
self.user_profiles = {}
def _create_soup(self, row):
"""
Create a 'soup' of features for each movie.
Combines all relevant text features into one string.
"""
features = []
# Add genres (with higher weight - repeat 3 times)
if pd.notna(row['genres']):
features.extend([row['genres']] * 3)
# Add director (with higher weight - repeat 2 times)
if pd.notna(row['director']):
features.extend([row['director']] * 2)
# Add keywords
if pd.notna(row['keywords']):
features.append(row['keywords'])
# Add cast
if pd.notna(row['cast']):
features.append(row['cast'])
# Add overview/description
if pd.notna(row['overview']):
features.append(row['overview'])
return ' '.join(features)
def fit(self, movies_df: pd.DataFrame):
"""
Build the content-based model.
Parameters:
-----------
movies_df : pd.DataFrame
DataFrame with movie information and features
"""
self.movies_df = movies_df.copy()
# Create the feature soup for each movie
self.movies_df['soup'] = self.movies_df.apply(self._create_soup, axis=1)
# Create TF-IDF matrix
self.tfidf = TfidfVectorizer(
stop_words='english',
max_features=5000, # Limit to top 5000 features
ngram_range=(1, 2) # Use unigrams and bigrams
)
self.tfidf_matrix = self.tfidf.fit_transform(self.movies_df['soup'])
# Create index mapping
self.movie_indices = pd.Series(
self.movies_df.index,
index=self.movies_df['title']
).drop_duplicates()
print(f"Built content model for {len(self.movies_df)} movies")
print(f"TF-IDF matrix shape: {self.tfidf_matrix.shape}")
def get_similar_movies(self, movie_title: str, n: int = 10) -> List[Tuple[str, float]]:
"""
Get movies similar to the given movie.
Parameters:
-----------
movie_title : str
Title of the movie
n : int
Number of similar movies to return
Returns:
--------
similar_movies : List[Tuple[str, float]]
List of (movie_title, similarity_score) tuples
"""
# Get movie index
if movie_title not in self.movie_indices:
return []
idx = self.movie_indices[movie_title]
# Compute cosine similarity with all other movies
movie_vector = self.tfidf_matrix[idx]
similarities = cosine_similarity(movie_vector, self.tfidf_matrix).flatten()
# Get top N similar movies (excluding the movie itself)
similar_indices = similarities.argsort()[::-1][1:n+1]
# Return movie titles and similarity scores
similar_movies = [
(self.movies_df.iloc[i]['title'], similarities[i])
for i in similar_indices
]
return similar_movies
def build_user_profile(self, user_id: str, rated_movies: Dict[str, float]):
"""
Build a user profile from their rating history.
Parameters:
-----------
user_id : str
User identifier
rated_movies : Dict[str, float]
Dictionary mapping movie_title -> rating
"""
# Get TF-IDF vectors for rated movies
movie_vectors = []
weights = []
for movie_title, rating in rated_movies.items():
if movie_title in self.movie_indices:
idx = self.movie_indices[movie_title]
movie_vectors.append(self.tfidf_matrix[idx].toarray().flatten())
weights.append(rating)
if not movie_vectors:
return
# Create weighted average profile
movie_vectors = np.array(movie_vectors)
weights = np.array(weights).reshape(-1, 1)
# Normalize weights
weights = weights / weights.sum()
# Weighted average
user_profile = (movie_vectors.T @ weights).flatten()
self.user_profiles[user_id] = user_profile
def recommend_for_user(self, user_id: str, rated_movies: Dict[str, float],
n: int = 10) -> List[Tuple[str, float]]:
"""
Get personalized recommendations for a user.
Parameters:
-----------
user_id : str
User identifier
rated_movies : Dict[str, float]
Movies the user has already rated
n : int
Number of recommendations to return
Returns:
--------
recommendations : List[Tuple[str, float]]
List of (movie_title, similarity_score) tuples
"""
# Build or get user profile
if user_id not in self.user_profiles:
self.build_user_profile(user_id, rated_movies)
user_profile = self.user_profiles[user_id]
# Compute similarity with all movies
similarities = cosine_similarity(
user_profile.reshape(1, -1),
self.tfidf_matrix
).flatten()
# Get top N movies (excluding already rated)
rated_titles = set(rated_movies.keys())
recommendations = []
for idx in similarities.argsort()[::-1]:
title = self.movies_df.iloc[idx]['title']
if title not in rated_titles:
recommendations.append((title, similarities[idx]))
if len(recommendations) >= n:
break
return recommendations
def explain_recommendation(self, movie_title: str, user_rated_movies: List[str]):
"""
Explain why a movie was recommended based on user's history.
Parameters:
-----------
movie_title : str
Recommended movie title
user_rated_movies : List[str]
Movies the user has rated highly
"""
if movie_title not in self.movie_indices:
print(f"Movie '{movie_title}' not found")
return
print(f"\nWhy we recommended '{movie_title}':")
print("=" * 70)
# Show movie features
idx = self.movie_indices[movie_title]
movie_data = self.movies_df.iloc[idx]
print(f"\nGenres: {movie_data.get('genres', 'N/A')}")
print(f"Director: {movie_data.get('director', 'N/A')}")
print(f"Keywords: {movie_data.get('keywords', 'N/A')}")
# Compare with user's rated movies
print(f"\nBecause you liked:")
for rated_movie in user_rated_movies[:3]:
similar_movies = self.get_similar_movies(rated_movie, n=20)
for rec_title, score in similar_movies:
if rec_title == movie_title:
print(f" - '{rated_movie}' (similarity: {score:.3f})")
break# Create sample movie dataset
def create_movie_dataset():
"""
Create a sample movie dataset with features.
Returns:
--------
movies_df : pd.DataFrame
DataFrame with movie information
"""
movies = {
'title': [
'Toy Story',
'Inception',
'The Matrix',
'Shrek',
'The Notebook',
'Star Wars: Episode IV',
'Frozen',
'The Dark Knight',
'Titanic',
'Avengers',
'Interstellar',
'The Shawshank Redemption',
'Pulp Fiction',
'Forrest Gump',
'The Godfather',
'Finding Nemo',
'The Lion King',
'Jurassic Park',
'Harry Potter and the Sorcerer\'s Stone',
'The Lord of the Rings: The Fellowship of the Ring'
],
'genres': [
'Animation Comedy Family',
'Action Sci-Fi Thriller',
'Action Sci-Fi',
'Animation Comedy Family',
'Romance Drama',
'Action Adventure Sci-Fi',
'Animation Family Musical',
'Action Crime Thriller',
'Romance Drama',
'Action Adventure Sci-Fi',
'Adventure Drama Sci-Fi',
'Drama Crime',
'Crime Drama Thriller',
'Drama Romance Comedy',
'Crime Drama',
'Animation Family Adventure',
'Animation Family Drama',
'Action Adventure Sci-Fi',
'Fantasy Adventure Family',
'Adventure Fantasy Action'
],
'director': [
'John Lasseter',
'Christopher Nolan',
'The Wachowskis',
'Andrew Adamson',
'Nick Cassavetes',
'George Lucas',
'Chris Buck',
'Christopher Nolan',
'James Cameron',
'Joss Whedon',
'Christopher Nolan',
'Frank Darabont',
'Quentin Tarantino',
'Robert Zemeckis',
'Francis Ford Coppola',
'Andrew Stanton',
'Roger Allers',
'Steven Spielberg',
'Chris Columbus',
'Peter Jackson'
],
'keywords': [
'toys friendship adventure',
'dreams reality heist mind-bending',
'virtual reality artificial intelligence rebellion',
'ogre fairy tale friendship',
'love memory alzheimer',
'space rebels empire force',
'ice magic sisters love',
'batman joker chaos justice',
'ship disaster love class-divide',
'superheroes team aliens invasion',
'space wormhole time father-daughter',
'prison hope friendship redemption',
'crime violence non-linear-narrative',
'life journey simple-man',
'mafia family power',
'ocean fish father-son',
'lion africa coming-of-age',
'dinosaurs theme-park science',
'magic wizard school friendship',
'quest ring fellowship evil'
],
'cast': [
'Tom Hanks Tim Allen',
'Leonardo DiCaprio Joseph Gordon-Levitt',
'Keanu Reeves Laurence Fishburne',
'Mike Myers Eddie Murphy',
'Ryan Gosling Rachel McAdams',
'Mark Hamill Harrison Ford',
'Kristen Bell Idina Menzel',
'Christian Bale Heath Ledger',
'Leonardo DiCaprio Kate Winslet',
'Robert Downey Jr Chris Evans',
'Matthew McConaughey Anne Hathaway',
'Tim Robbins Morgan Freeman',
'John Travolta Samuel L. Jackson',
'Tom Hanks Robin Wright',
'Marlon Brando Al Pacino',
'Albert Brooks Ellen DeGeneres',
'Matthew Broderick James Earl Jones',
'Sam Neill Laura Dern',
'Daniel Radcliffe Emma Watson',
'Elijah Wood Ian McKellen'
],
'overview': [
'A cowboy doll is profoundly threatened when a new spaceman figure supplants him as top toy.',
'A thief who enters the dreams of others to steal secrets from their subconscious.',
'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
'An ogre rescues a princess from a tower and they fall in love despite their differences.',
'A poor yet passionate young man falls in love with a rich young woman in the 1940s.',
'Luke Skywalker joins forces with a Jedi Knight to rescue a princess and save the galaxy.',
'When the newly crowned Queen Elsa accidentally uses her power to curse her homeland.',
'Batman must accept one of the greatest psychological tests to fight injustice.',
'A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious Titanic.',
'Earth\'s mightiest heroes must come together to stop an alien invasion.',
'A team of explorers travel through a wormhole in space in an attempt to ensure humanity\'s survival.',
'Two imprisoned men bond over years finding solace and eventual redemption through acts of common decency.',
'The lives of two mob hitmen, a boxer, and a pair of diner bandits intertwine in four tales of violence.',
'The presidencies of Kennedy and Johnson unfold through the perspective of an Alabama man.',
'The aging patriarch of an organized crime dynasty transfers control to his reluctant son.',
'After his son is captured in the Great Barrier Reef a timid clownfish sets out on a journey to bring him home.',
'Lion cub and future king Simba searches for his identity after the murder of his father.',
'Scientists clone dinosaurs to populate a theme park which suffers a major security breakdown.',
'An orphaned boy enrolls in a school of wizardry where he learns the truth about himself.',
'A meek Hobbit embarks on a journey to destroy the One Ring and save Middle-earth.'
]
}
return pd.DataFrame(movies)
# Create movie dataset
movies_df = create_movie_dataset()
print("Movie Dataset:")
print("=" * 70)
print(movies_df[['title', 'genres', 'director']].to_string())
print("\n")
# Create and fit the content-based recommender
cb_recommender = ContentBasedRecommender()
cb_recommender.fit(movies_df)
print("\n")
# Example 1: Find similar movies
movie_title = "Inception"
print(f"Movies similar to '{movie_title}':")
print("=" * 70)
similar_movies = cb_recommender.get_similar_movies(movie_title, n=5)
for rank, (title, score) in enumerate(similar_movies, 1):
print(f"{rank}. {title} (similarity: {score:.3f})")
print("\n")
# Example 2: Personalized recommendations
user_ratings = {
'Inception': 5.0,
'The Matrix': 5.0,
'The Dark Knight': 4.5,
'Interstellar': 5.0
}
print("User Rating History:")
print("=" * 70)
for movie, rating in user_ratings.items():
print(f" - {movie}: {rating} stars")
print("\n")
print("Personalized Recommendations:")
print("=" * 70)
recommendations = cb_recommender.recommend_for_user(
user_id='user_001',
rated_movies=user_ratings,
n=5
)
for rank, (title, score) in enumerate(recommendations, 1):
print(f"{rank}. {title} (match score: {score:.3f})")
print("\n")
# Example 3: Explain a recommendation
if recommendations:
rec_title = recommendations[0][0]
cb_recommender.explain_recommendation(
rec_title,
list(user_ratings.keys())
)Movie Dataset:
======================================================================
title genres director
0 Toy Story Animation Comedy Family John Lasseter
1 Inception Action Sci-Fi Thriller Christopher Nolan
2 The Matrix Action Sci-Fi The Wachowskis
3 Shrek Animation Comedy Family Andrew Adamson
4 The Notebook Romance Drama Nick Cassavetes
5 Star Wars: Episode IV Action Adventure Sci-Fi George Lucas
6 Frozen Animation Family Musical Chris Buck
7 The Dark Knight Action Crime Thriller Christopher Nolan
8 Titanic Romance Drama James Cameron
9 Avengers Action Adventure Sci-Fi Joss Whedon
10 Interstellar Adventure Drama Sci-Fi Christopher Nolan
11 The Shawshank Redemption Drama Crime Frank Darabont
12 Pulp Fiction Crime Drama Thriller Quentin Tarantino
13 Forrest Gump Drama Romance Comedy Robert Zemeckis
14 The Godfather Crime Drama Francis Ford Coppola
15 Finding Nemo Animation Family Adventure Andrew Stanton
16 The Lion King Animation Family Drama Roger Allers
17 Jurassic Park Action Adventure Sci-Fi Steven Spielberg
18 Harry Potter and the Sorcerer's Stone Fantasy Adventure Family Chris Columbus
19 The Lord of the Rings: The Fellowship of the Ring Adventure Fantasy Action Peter Jackson
Built content model for 20 movies
TF-IDF matrix shape: (20, 762)
Movies similar to 'Inception':
======================================================================
1. The Matrix (similarity: 0.305)
2. The Dark Knight (similarity: 0.267)
3. Interstellar (similarity: 0.226)
4. Avengers (similarity: 0.175)
5. Star Wars: Episode IV (similarity: 0.173)
User Rating History:
======================================================================
- Inception: 5.0 stars
- The Matrix: 5.0 stars
- The Dark Knight: 4.5 stars
- Interstellar: 5.0 stars
Personalized Recommendations:
======================================================================
1. Star Wars: Episode IV (match score: 0.252)
2. Avengers (match score: 0.251)
3. Jurassic Park (match score: 0.231)
4. Pulp Fiction (match score: 0.095)
5. The Lord of the Rings: The Fellowship of the Ring (match score: 0.061)
Why we recommended 'Star Wars: Episode IV':
======================================================================
Genres: Action Adventure Sci-Fi
Director: George Lucas
Keywords: space rebels empire force
Because you liked:
- 'Inception' (similarity: 0.173)
- 'The Matrix' (similarity: 0.221)
- 'The Dark Knight' (similarity: 0.040)
Understanding the TF-IDF Implementation
1. Feature Soup Creation:
def _create_soup(self, row):
# Combine all features with different weights
features = []
features.extend([row['genres']] * 3) # Genres are important!
features.extend([row['director']] * 2) # Director matters too
features.append(row['keywords'])
return ' '.join(features)This creates a weighted “bag of words” where important features appear multiple times.
2. TF-IDF Transformation:
self.tfidf = TfidfVectorizer(
stop_words='english', # Remove common words (the, a, is, etc.)
max_features=5000, # Keep top 5000 most important words
ngram_range=(1, 2) # Use single words and word pairs
)3. Similarity Computation:
similarities = cosine_similarity(movie_vector, self.tfidf_matrix)This efficiently computes cosine similarity between one movie and all others.
4. User Profile:
# Weighted average of movie vectors
user_profile = (movie_vectors.T @ weights).flatten()Creates a “taste profile” that combines all movies the user liked.
Comparison and Hybrid Approaches
When to Use Each Approach
| Scenario | Best Approach |
|---|---|
| Cold Start - New User | Content-Based (ask preferences) |
| Cold Start - New Item | Content-Based (use item features) |
| Serendipity/Discovery | Collaborative (finds unexpected matches) |
| Explainability | Content-Based (can show why) |
| Limited Item Features | Collaborative (learns features) |
| Privacy Concerns | Content-Based (no user data sharing) |
| Scalability | Depends on implementation |
Hybrid Approach
The best recommender systems combine both approaches:
class HybridRecommender:
"""
Hybrid recommender combining collaborative and content-based filtering.
"""
def __init__(self, cf_model, cb_model, cf_weight=0.5):
"""
Initialize hybrid recommender.
Parameters:
-----------
cf_model : CollaborativeFilteringRecommender
Trained collaborative filtering model
cb_model : ContentBasedRecommender
Trained content-based model
cf_weight : float
Weight for collaborative filtering (0-1)
Content-based weight = 1 - cf_weight
"""
self.cf_model = cf_model
self.cb_model = cb_model
self.cf_weight = cf_weight
self.cb_weight = 1 - cf_weight
def recommend(self, user_idx: int, user_id: str,
user_ratings: Dict[str, float],
Y: np.ndarray, n: int = 10) -> List[Tuple[str, float]]:
"""
Get hybrid recommendations.
Combines scores from both models.
"""
# Get collaborative filtering recommendations
cf_recs = self.cf_model.recommend_for_user(user_idx, Y, n=50)
# Get content-based recommendations
cb_recs = self.cb_model.recommend_for_user(user_id, user_ratings, n=50)
# Combine scores
combined_scores = {}
# Add CF scores
for movie_idx, score in cf_recs:
movie_title = get_movie_title(movie_idx) # Helper function
combined_scores[movie_title] = self.cf_weight * score
# Add CB scores
for movie_title, score in cb_recs:
if movie_title in combined_scores:
combined_scores[movie_title] += self.cb_weight * score
else:
combined_scores[movie_title] = self.cb_weight * score
# Sort by combined score
recommendations = sorted(
combined_scores.items(),
key=lambda x: x[1],
reverse=True
)[:n]
return recommendationsReal-World Production Considerations
1. Scalability
For millions of users and items:
# Use approximate nearest neighbors
from sklearn.neighbors import NearestNeighbors
class ScalableContentBasedRecommender:
def __init__(self, n_neighbors=50):
self.nn_model = NearestNeighbors(
n_neighbors=n_neighbors,
metric='cosine',
algorithm='brute' # or 'ball_tree', 'kd_tree'
)
def fit(self, item_features):
self.nn_model.fit(item_features)
def get_similar_items(self, item_vector, n=10):
distances, indices = self.nn_model.kneighbors(
item_vector.reshape(1, -1),
n_neighbors=n
)
return indices[0], 1 - distances[0] # Convert distance to similarity2. Online Learning
Update models as new ratings arrive:
class OnlineCollaborativeFiltering:
def update_user_profile(self, user_idx, new_rating, movie_idx):
"""Update user profile with new rating using SGD."""
# Compute gradient for this single rating
prediction = np.dot(self.W[user_idx], self.X[movie_idx]) + self.b[0, user_idx]
error = prediction - new_rating
# Update parameters
self.W[user_idx] -= self.learning_rate * (
error * self.X[movie_idx] + self.lambda_reg * self.W[user_idx]
)
self.X[movie_idx] -= self.learning_rate * (
error * self.W[user_idx] + self.lambda_reg * self.X[movie_idx]
)
self.b[0, user_idx] -= self.learning_rate * error3. A/B Testing
Always test recommendations in production:
class ABTestingRecommender:
def __init__(self, model_a, model_b, traffic_split=0.5):
self.model_a = model_a
self.model_b = model_b
self.traffic_split = traffic_split
def recommend(self, user_id, **kwargs):
# Randomly assign user to treatment or control
if hash(user_id) % 100 < self.traffic_split * 100:
model = self.model_a
variant = 'A'
else:
model = self.model_b
variant = 'B'
recommendations = model.recommend(user_id, **kwargs)
# Log for analysis
self.log_recommendations(user_id, variant, recommendations)
return recommendations4. Diversity
Avoid filter bubbles by adding diversity:
def diverse_recommendations(recommendations, diversity_weight=0.3):
"""
Re-rank recommendations to increase diversity.
Uses Maximal Marginal Relevance (MMR).
"""
selected = []
candidates = recommendations.copy()
while len(selected) < 10 and candidates:
if not selected:
# Add highest scoring item first
selected.append(candidates.pop(0))
else:
# Balance relevance and diversity
mmr_scores = []
for cand_title, cand_score in candidates:
# Relevance
relevance = cand_score
# Diversity (min similarity to already selected)
diversity = min([
1 - cosine_similarity(cand_vector, sel_vector)
for sel_title, _ in selected
])
# MMR score
mmr = (1 - diversity_weight) * relevance + diversity_weight * diversity
mmr_scores.append(mmr)
# Select item with highest MMR
best_idx = np.argmax(mmr_scores)
selected.append(candidates.pop(best_idx))
return selectedConclusion
Key Takeaways
Collaborative Filtering: - ✓ Discovers hidden patterns automatically - ✓ Excellent for personalization - ✓ Improves with more data - ✗ Cold start problem - ✗ Requires sufficient user-item interactions
Content-Based Filtering: - ✓ No cold start for items - ✓ Explainable recommendations - ✓ Works with limited data - ✗ Limited serendipity - ✗ Requires feature engineering
Best Practice: Use a hybrid approach that combines the strengths of both methods!
Further Reading
Academic Papers: 1. Matrix Factorization Techniques for Recommender Systems (Koren et al., 2009) 2. Amazon.com Recommendations: Item-to-Item Collaborative Filtering (Linden et al., 2003) 3. Content-Based Recommendation Systems (Pazzani & Billsus, 2007)
Frameworks: - Surprise - Python scikit for recommender systems - LightFM - Hybrid recommendation algorithm - TensorFlow Recommenders - Deep learning for recommendations
Datasets: - MovieLens - Movie ratings - Amazon Reviews - Product reviews - Last.fm - Music listening data
Practice Exercises
- Modify the collaborative filtering model:
- Try different learning rates and regularization parameters
- Experiment with different numbers of features (5, 10, 20, 50)
- Add early stopping based on validation set
- Enhance the content-based model:
- Add numeric features (year, rating, runtime)
- Try different similarity metrics (Euclidean, Pearson)
- Implement user feedback loops
- Build a hybrid system:
- Implement the hybrid recommender from scratch
- Test different weighting strategies
- Add diversity promotion
- Evaluate your models:
- Implement RMSE, MAE, and precision@k metrics
- Use cross-validation
- Compare model performance
Final Thoughts
Recommender systems are at the heart of modern digital experiences. Whether you’re building the next Netflix, Spotify, or Amazon, understanding both collaborative and content-based filtering is essential.
The techniques covered here form the foundation, but modern production systems often use: - Deep learning (neural collaborative filtering) - Context-aware recommendations (time, location, device) - Multi-armed bandits (exploration vs exploitation) - Graph-based methods (social networks)
Keep learning, keep experimenting, and most importantly - keep building! 🚀