Building collaborative and content based reccomendation systems

Introduction

Imagine you’re scrolling through Netflix, trying to decide what to watch. Netflix shows you personalized recommendations - movies and shows it thinks you’ll love. How does it know? The answer lies in recommender systems, one of the most successful applications of machine learning in our daily lives.

In this guide, we’ll build two types of movie recommender systems from scratch:

Collaborative Filtering - “People like you also enjoyed…”
Content-Based Filtering - “Because you watched…”

We’ll start with intuitive explanations anyone can understand, then dive deep into the mathematics and implementation details.

The Recommendation Problem

The Challenge

You run a streaming service with: - 10,000 movies - 100,000 users - Most users have only rated ~50 movies

The Big Question: How do you recommend the right movies to each user from the remaining 9,950+ unwatched movies?

The Data

Here’s what a typical ratings matrix looks like:

              User1  User2  User3  User4  User5
Toy Story       5.0    ?      4.0    ?      5.0
Inception       ?      3.5    ?      4.5    ?
Shrek           4.0    5.0    2.0    ?      4.5
The Matrix      ?      ?      5.0    5.0    ?
Frozen          3.0    4.5    ?      2.0    ?

Where: - Numbers = ratings (1-5 stars) - ? = movie not watched/rated (this is ~95% of the matrix!)

Goal: Fill in the ? marks with accurate predictions!

Collaborative Filtering

Intuitive Explanation

The Core Idea

Remember when you asked a friend for a movie recommendation and they said:

“Well, we both loved Inception and Interstellar, so you’ll probably love Tenet too!”

That’s collaborative filtering! It finds users with similar tastes and uses their ratings to make predictions.

A Simple Story

Let’s say we have three friends:

Alice: - Loved: The Matrix (5⭐), Inception (5⭐) - Hated: The Notebook (1⭐)

Bob: - Loved: The Matrix (5⭐), Inception (4.5⭐) - Hated: The Notebook (2⭐)

Charlie: - Loved: The Notebook (5⭐), Titanic (5⭐) - Hated: The Matrix (2⭐)

Now, Alice and Bob have very similar taste (both love sci-fi, hate romance).

If Bob watched “Blade Runner” and gave it 5⭐, we can predict Alice will probably rate it highly too!

The Magic Trick

But here’s where collaborative filtering gets really clever:

We don’t need to know that these movies are “sci-fi” or “romance”!

The algorithm discovers hidden patterns automatically: - Pattern 1: “Sci-fi action lovers” - Pattern 2: “Romance drama fans” - Pattern 3: “Comedy enthusiasts” - And many more subtle patterns…

Technical Deep Dive Collaborative Filtering

Mathematical Foundation

Collaborative filtering learns two sets of vectors:

For each movie i: - Feature vector x^(i) ∈ ℝ^n (e.g., n=10 dimensions)

For each user j: - Parameter vector w^(j) ∈ ℝ^n - Bias term b^(j) ∈ ℝ

Prediction formula:

ŷ^(i,j) = w^(j) · x^(i) + b^(j)

Where · denotes the dot product (element-wise multiplication and sum).

What Do These Vectors Mean?

Think of x^(i) as the “DNA” of movie i:

x^(Inception) = [0.92, 0.15, 0.78, 0.05, 0.31, ...]
                  ↑     ↑     ↑     ↑     ↑
               feature feature feature feature ...
                 1      2      3      4

The algorithm might learn that: - Feature 1 = “mind-bending complexity” - Feature 2 = “romance level” - Feature 3 = “action intensity” - etc.

Think of w^(j) as user j’s “taste profile”:

w^(Alice) = [0.89, 0.10, 0.82, 0.08, 0.25, ...]
             ↑     ↑     ↑     ↑     ↑
           likes  dislikes likes dislikes neutral
         complex romance action   slow   about
          plots            films  films  feature 5

The Cost Function

We want to minimize the squared prediction error across all known ratings:

J(X, W, b) = 1/2 ∑_{(i,j):r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j))² 
             + λ/2 ∑_{j=1}^{n_u} ∑_{k=1}^{n} (w_k^(j))² 
             + λ/2 ∑_{i=1}^{n_m} ∑_{k=1}^{n} (x_k^(i))²

Where: - r(i,j) = 1 if user j rated movie i, 0 otherwise - y^(i,j) = actual rating - First term = prediction error - Second & third terms = regularization (prevents overfitting) - λ = regularization parameter

Why Regularization?

Without regularization, the model could “cheat”:

# Bad solution (overfitting):
x = [1000.0, 500.0, ...]
w = [0.005, 0.010, ...]
prediction = 1000*0.005 + 500*0.010 = 10.0 

# Good solution (generalizes):
x = [0.92, 0.15, ...]
w = [0.89, 0.10, ...]
prediction = 0.92*0.89 + 0.15*0.10 = 0.834

oth achieve similar training predictions, but the second generalizes better to new data!

Optimization with Gradient Descent

We use gradient descent to find optimal X, W, and b:

Repeat until convergence:
    1. Compute gradients: ∂J/∂X, ∂J/∂W, ∂J/∂b
    2. Update parameters:
       X := X - α * ∂J/∂X
       W := W - α * ∂J/∂W
       b := b - α * ∂J/∂b

Where α is the learning rate.

The partial derivatives are:

∂J/∂x_k^(i) = ∑_{j:r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j)) * w_k^(j) + λ*x_k^(i)

∂J/∂w_k^(j) = ∑_{i:r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j)) * x_k^(i) + λ*w_k^(j)

∂J/∂b^(j) = ∑_{i:r(i,j)=1} (w^(j)·x^(i) + b^(j) - y^(i,j))

Collaborative filtering - Full Implementation from Scratch

import numpy as np
import pandas as pd
from typing import Tuple, List
import matplotlib.pyplot as plt

class CollaborativeFilteringRecommender:
    """
    Matrix Factorization based Collaborative Filtering Recommender
    
    Learns latent features for movies and users to predict ratings.
    """
    
    def __init__(self, n_features: int = 10, learning_rate: float = 0.01, 
                 lambda_reg: float = 0.1, n_iterations: int = 1000):
        """
        Initialize the recommender system.
        
        Parameters:
        -----------
        n_features : int
            Number of latent features to learn
        learning_rate : float
            Step size for gradient descent
        lambda_reg : float
            Regularization parameter
        n_iterations : int
            Number of training iterations
        """
        self.n_features = n_features
        self.learning_rate = learning_rate
        self.lambda_reg = lambda_reg
        self.n_iterations = n_iterations
        
        # These will be learned
        self.X = None  # Movie features
        self.W = None  # User parameters
        self.b = None  # User biases
        
        # Store training history
        self.cost_history = []
        
    def _initialize_parameters(self, n_movies: int, n_users: int):
        """
        Initialize X, W, and b with small random values.
        
        Using small random values breaks symmetry and helps convergence.
        """
        np.random.seed(42)
        self.X = np.random.randn(n_movies, self.n_features) * 0.01
        self.W = np.random.randn(n_users, self.n_features) * 0.01
        self.b = np.zeros((1, n_users))
        
    def _compute_cost(self, Y: np.ndarray, R: np.ndarray) -> float:
        """
        Compute the collaborative filtering cost function.
        
        J = 1/2 * sum of squared errors + regularization
        
        Parameters:
        -----------
        Y : np.ndarray, shape (n_movies, n_users)
            Ratings matrix
        R : np.ndarray, shape (n_movies, n_users)
            Indicator matrix (1 if rated, 0 otherwise)
            
        Returns:
        --------
        cost : float
            The cost function value
        """
        n_movies, n_users = Y.shape
        
        # Predictions for all movie-user pairs
        predictions = np.dot(self.X, self.W.T) + self.b
        
        # Squared error (only for rated movies)
        errors = (predictions - Y) * R
        squared_error = 0.5 * np.sum(errors ** 2)
        
        # Regularization terms
        reg_X = (self.lambda_reg / 2) * np.sum(self.X ** 2)
        reg_W = (self.lambda_reg / 2) * np.sum(self.W ** 2)
        
        # Total cost
        cost = squared_error + reg_X + reg_W
        
        return cost
    
    def _compute_gradients(self, Y: np.ndarray, R: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """
        Compute gradients of the cost function.
        
        Returns:
        --------
        grad_X : np.ndarray
            Gradient with respect to X
        grad_W : np.ndarray
            Gradient with respect to W
        grad_b : np.ndarray
            Gradient with respect to b
        """
        # Predictions
        predictions = np.dot(self.X, self.W.T) + self.b
        
        # Error term
        error = (predictions - Y) * R
        
        # Gradients
        grad_X = np.dot(error, self.W) + self.lambda_reg * self.X
        grad_W = np.dot(error.T, self.X) + self.lambda_reg * self.W
        grad_b = np.sum(error, axis=0, keepdims=True)
        
        return grad_X, grad_W, grad_b
    
    def fit(self, Y: np.ndarray, R: np.ndarray, verbose: bool = True):
        """
        Train the collaborative filtering model.
        
        Parameters:
        -----------
        Y : np.ndarray, shape (n_movies, n_users)
            Ratings matrix
        R : np.ndarray, shape (n_movies, n_users)
            Indicator matrix (1 if rated, 0 otherwise)
        verbose : bool
            Whether to print progress
        """
        n_movies, n_users = Y.shape
        
        # Initialize parameters
        self._initialize_parameters(n_movies, n_users)
        
        # Gradient descent
        for iteration in range(self.n_iterations):
            # Compute cost
            cost = self._compute_cost(Y, R)
            self.cost_history.append(cost)
            
            # Compute gradients
            grad_X, grad_W, grad_b = self._compute_gradients(Y, R)
            
            # Update parameters
            self.X -= self.learning_rate * grad_X
            self.W -= self.learning_rate * grad_W
            self.b -= self.learning_rate * grad_b
            
            # Print progress
            if verbose and (iteration % 100 == 0 or iteration == self.n_iterations - 1):
                print(f"Iteration {iteration:4d}: Cost = {cost:.4f}")
    
    def predict(self, movie_idx: int = None, user_idx: int = None) -> np.ndarray:
        """
        Make predictions.
        
        Parameters:
        -----------
        movie_idx : int, optional
            Specific movie index to predict for
        user_idx : int, optional
            Specific user index to predict for
            
        Returns:
        --------
        predictions : np.ndarray
            Predicted ratings
        """
        # Compute all predictions
        all_predictions = np.dot(self.X, self.W.T) + self.b
        
        if movie_idx is not None and user_idx is not None:
            return all_predictions[movie_idx, user_idx]
        elif movie_idx is not None:
            return all_predictions[movie_idx, :]
        elif user_idx is not None:
            return all_predictions[:, user_idx]
        else:
            return all_predictions
    
    def recommend_for_user(self, user_idx: int, Y: np.ndarray, 
                          n_recommendations: int = 10) -> List[Tuple[int, float]]:
        """
        Get top N movie recommendations for a user.
        
        Parameters:
        -----------
        user_idx : int
            User index
        Y : np.ndarray
            Original ratings matrix (to exclude already rated movies)
        n_recommendations : int
            Number of recommendations to return
            
        Returns:
        --------
        recommendations : List[Tuple[int, float]]
            List of (movie_idx, predicted_rating) tuples
        """
        # Get predictions for this user
        predictions = self.predict(user_idx=user_idx)
        
        # Find movies the user hasn't rated
        unrated_mask = (Y[:, user_idx] == 0)
        unrated_predictions = predictions.copy()
        unrated_predictions[~unrated_mask] = -np.inf
        
        # Get top N
        top_indices = np.argsort(unrated_predictions)[::-1][:n_recommendations]
        recommendations = [(idx, predictions[idx]) for idx in top_indices]
        
        return recommendations
    
    def plot_cost_history(self):
        """Plot the training cost history."""
        plt.figure(figsize=(10, 6))
        plt.plot(self.cost_history)
        plt.xlabel('Iteration')
        plt.ylabel('Cost')
        plt.title('Training Cost History')
        plt.grid(True)
        plt.show()

# Example Usage
# -------------

def create_sample_data():
    """
    Create a sample movie ratings dataset.
    
    Returns:
    --------
    Y : np.ndarray
        Ratings matrix
    R : np.ndarray
        Indicator matrix
    movie_names : List[str]
        List of movie names
    user_names : List[str]
        List of user names
    """
    # Sample movies
    movie_names = [
        "Toy Story (1995)",
        "Inception (2010)",
        "The Matrix (1999)",
        "Shrek (2001)",
        "The Notebook (2004)",
        "Star Wars (1977)",
        "Frozen (2013)",
        "The Dark Knight (2008)",
        "Titanic (1997)",
        "Avengers (2012)"
    ]
    
    user_names = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
    
    # Ratings matrix (0 = not rated)
    Y = np.array([
        [5.0, 4.5, 4.0, 5.0, 4.0],  # Toy Story
        [0.0, 3.5, 5.0, 0.0, 4.5],  # Inception
        [0.0, 4.0, 5.0, 0.0, 5.0],  # The Matrix
        [4.0, 5.0, 2.0, 4.5, 3.5],  # Shrek
        [3.0, 0.0, 1.0, 4.5, 2.0],  # The Notebook
        [0.0, 5.0, 5.0, 0.0, 5.0],  # Star Wars
        [3.0, 4.5, 0.0, 5.0, 3.0],  # Frozen
        [0.0, 4.0, 5.0, 0.0, 4.5],  # The Dark Knight
        [2.0, 0.0, 1.0, 5.0, 2.5],  # Titanic
        [0.0, 3.5, 4.5, 0.0, 4.0],  # Avengers
    ])
    
    # Create indicator matrix
    R = (Y != 0).astype(int)
    
    return Y, R, movie_names, user_names

# Create sample data
Y, R, movie_names, user_names = create_sample_data()

print("Original Ratings Matrix:")
print("=" * 70)
df = pd.DataFrame(Y, index=movie_names, columns=user_names)
print(df)
print("\n")

# Create and train the model
model = CollaborativeFilteringRecommender(
    n_features=5,
    learning_rate=0.1,
    lambda_reg=1.0,
    n_iterations=1000
)

print("Training Collaborative Filtering Model...")
print("=" * 70)
model.fit(Y, R, verbose=True)
print("\n")

# Make predictions
predictions = model.predict()

print("Predicted Ratings Matrix:")
print("=" * 70)
df_pred = pd.DataFrame(
    np.round(predictions, 2), 
    index=movie_names, 
    columns=user_names
)
print(df_pred)
print("\n")

# Get recommendations for a specific user
user_idx = 0  # Alice
print(f"Top 5 Recommendations for {user_names[user_idx]}:")
print("=" * 70)
recommendations = model.recommend_for_user(user_idx, Y, n_recommendations=5)

for rank, (movie_idx, rating) in enumerate(recommendations, 1):
    print(f"{rank}. {movie_names[movie_idx]}: {rating:.2f} stars")

Original Ratings Matrix:
======================================================================
                        Alice  Bob  Charlie  Diana  Eve
Toy Story (1995)          5.0  4.5      4.0    5.0  4.0
Inception (2010)          0.0  3.5      5.0    0.0  4.5
The Matrix (1999)         0.0  4.0      5.0    0.0  5.0
Shrek (2001)              4.0  5.0      2.0    4.5  3.5
The Notebook (2004)       3.0  0.0      1.0    4.5  2.0
Star Wars (1977)          0.0  5.0      5.0    0.0  5.0
Frozen (2013)             3.0  4.5      0.0    5.0  3.0
The Dark Knight (2008)    0.0  4.0      5.0    0.0  4.5
Titanic (1997)            2.0  0.0      1.0    5.0  2.5
Avengers (2012)           0.0  3.5      4.5    0.0  4.0


Training Collaborative Filtering Model...
======================================================================
Iteration    0: Cost = 310.8793
Iteration  100: Cost = 8.1080
Iteration  200: Cost = 8.1074
Iteration  300: Cost = 8.1074
Iteration  400: Cost = 8.1073
Iteration  500: Cost = 8.1073
Iteration  600: Cost = 8.1073
Iteration  700: Cost = 8.1072
Iteration  800: Cost = 8.1071
Iteration  900: Cost = 8.1071
Iteration  999: Cost = 8.1070


Predicted Ratings Matrix:
======================================================================
                        Alice   Bob  Charlie  Diana   Eve
Toy Story (1995)         4.39  4.45     4.01   4.87  4.19
Inception (2010)         4.27  3.92     4.75   5.03  4.44
The Matrix (1999)        4.52  4.09     4.86   4.99  4.59
Shrek (2001)             3.73  4.78     2.46   4.74  3.32
The Notebook (2004)      2.92  4.69     1.34   4.73  2.54
Star Wars (1977)         4.76  4.39     4.73   4.91  4.65
Frozen (2013)            3.33  4.36     2.57   4.86  3.20
The Dark Knight (2008)   4.39  4.06     4.69   4.99  4.47
Titanic (1997)           2.63  4.44     1.39   4.80  2.46
Avengers (2012)          4.01  3.96     4.28   5.00  4.15


Top 5 Recommendations for Alice:
======================================================================
1. Star Wars (1977): 4.76 stars
2. The Matrix (1999): 4.52 stars
3. The Dark Knight (2008): 4.39 stars
4. Inception (2010): 4.27 stars
5. Avengers (2012): 4.01 stars

Understanding the Code

Key Components:

Initialization (_initialize_parameters)
- Uses small random values (×0.01) to break symmetry
- Sets random seed for reproducibility
Cost Function (_compute_cost)
- Vectorized implementation for speed
- Computes predictions: X @ W.T + b
- Only counts error where R == 1
- Adds regularization penalties
Gradients (_compute_gradients)
- Uses matrix operations for efficiency
- Computes partial derivatives for all parameters
- Includes regularization terms
Training (fit)
- Iteratively updates parameters
- Tracks cost history
- Uses batch gradient descent
Prediction (predict)
- Can predict for specific movie/user or all pairs
- Simply computes: X @ W.T + b
Recommendations (recommend_for_user)
- Excludes already-rated movies
- Sorts by predicted rating
- Returns top N recommendations

Content-Based Filtering

Intuitive Explanation Content-Based Filtering

The Core Idea

Content-based filtering is like having a friend who says:

“You loved Inception, which is a sci-fi thriller with mind-bending plot. Try ‘Shutter Island’ - also a mind-bending thriller!”

Instead of comparing users, we compare items (movies) based on their features.

A Simple Story

Imagine you’re at a video store (remember those?) and you tell the clerk:

You: “I loved The Lord of the Rings!”

Clerk: “Great! That’s fantasy, epic adventure, directed by Peter Jackson. Here are some similar movies…” - The Hobbit (same director, same genre) ✓ - Game of Thrones (fantasy, epic) ✓ - Star Wars (epic adventure) ✓

The clerk is using the content (genre, director, actors) to recommend.

How It Works

Step 1: Describe each movie with features

Inception:
- Genre: Sci-Fi, Action, Thriller
- Director: Christopher Nolan
- Lead Actor: Leonardo DiCaprio
- Year: 2010
- Rating: PG-13
- Keywords: dreams, reality, heist

Step 2: Convert to numbers (feature vector)

Inception = [
    0.8,  # sci-fi score
    0.7,  # action score
    0.9,  # thriller score
    0.2,  # romance score
    0.1,  # comedy score
]

Step 3: Build user profile from their history

If you watched and loved: - Inception (sci-fi: 0.8, action: 0.7, thriller: 0.9) - The Matrix (sci-fi: 0.9, action: 0.8, thriller: 0.6) - Interstellar (sci-fi: 1.0, action: 0.3, thriller: 0.4)

Your profile (average):

You = [
    0.9,  # loves sci-fi!
    0.6,  # likes action
    0.63, # likes thrillers
    0.1,  # doesn't like romance
    0.05, # doesn't like comedy
]

Step 4: Find similar movies

Calculate similarity between your profile and all unwatched movies!

Technical Deep Dive Content based filtering

Feature Extraction

For content-based filtering, we need to extract features from movies. Common approaches:

1. Manual Features:

features = {
    'genre_scifi': 1 if 'Sci-Fi' in genres else 0,
    'genre_action': 1 if 'Action' in genres else 0,
    'director_nolan': 1 if director == 'Christopher Nolan' else 0,
    'year': (year - 1900) / 100,  # Normalized
    'runtime': runtime / 200,      # Normalized
    # ... many more
}

2. TF-IDF (Term Frequency-Inverse Document Frequency):

For text features (plot summaries, keywords):

TF-IDF(term, document) = TF(term, document) × IDF(term)

TF(term, doc) = (# times term appears in doc) / (# total terms in doc)

IDF(term) = log(N / (# documents containing term))

This gives higher weight to unique, descriptive words.

3. Embeddings:

Modern approaches use neural networks to learn dense representations: - Word2Vec for text - Image embeddings for posters - Audio embeddings for soundtracks

Similarity Metrics

1. Cosine Similarity (Most Common)

similarity(A, B) = (A · B) / (||A|| × ||B||)
                 = cos(θ)

Where: - A · B = dot product - ||A|| = L2 norm (magnitude) of vector A - θ = angle between vectors

Properties: - Range: [-1, 1] (usually [0, 1] for non-negative features) - 1 = identical direction - 0 = orthogonal (no similarity) - -1 = opposite direction

Example:

movie_A = np.array([0.9, 0.8, 0.1, 0.2])
movie_B = np.array([0.8, 0.7, 0.2, 0.3])

# Cosine similarity
similarity = np.dot(movie_A, movie_B) / (
    np.linalg.norm(movie_A) * np.linalg.norm(movie_B)
)
# similarity ≈ 0.98 (very similar!)

2. Euclidean Distance

distance(A, B) = sqrt(∑(A_i - B_i)²)

Smaller distance = more similar.

3. Pearson Correlation

correlation(A, B) = cov(A, B) / (σ_A × σ_B)

Measures linear relationship between vectors.

User Profile Construction

Simple Average:

profile = (1/n) ∑_{i=1}^n item_features_i

Weighted Average (by ratings):

profile = ∑_{i=1}^n (rating_i × item_features_i) / ∑_{i=1}^n rating_i

Preference Learning:

More sophisticated: learn a weight vector w that predicts ratings:

predicted_rating(item) = w · item_features

Use linear regression or other ML algorithms to learn w.

Content BAsed Filtering Full Implementation with TF-IDF and Cosine Similarity

import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')


class ContentBasedRecommender:
    """
    Content-Based Filtering Recommender using TF-IDF and Cosine Similarity.
    
    Recommends items similar to what the user has liked in the past.
    """
    
    def __init__(self):
        """Initialize the recommender."""
        self.tfidf = None
        self.tfidf_matrix = None
        self.movie_indices = None
        self.movies_df = None
        self.user_profiles = {}
        
    def _create_soup(self, row):
        """
        Create a 'soup' of features for each movie.
        
        Combines all relevant text features into one string.
        """
        features = []
        
        # Add genres (with higher weight - repeat 3 times)
        if pd.notna(row['genres']):
            features.extend([row['genres']] * 3)
        
        # Add director (with higher weight - repeat 2 times)
        if pd.notna(row['director']):
            features.extend([row['director']] * 2)
        
        # Add keywords
        if pd.notna(row['keywords']):
            features.append(row['keywords'])
        
        # Add cast
        if pd.notna(row['cast']):
            features.append(row['cast'])
        
        # Add overview/description
        if pd.notna(row['overview']):
            features.append(row['overview'])
        
        return ' '.join(features)
    
    def fit(self, movies_df: pd.DataFrame):
        """
        Build the content-based model.
        
        Parameters:
        -----------
        movies_df : pd.DataFrame
            DataFrame with movie information and features
        """
        self.movies_df = movies_df.copy()
        
        # Create the feature soup for each movie
        self.movies_df['soup'] = self.movies_df.apply(self._create_soup, axis=1)
        
        # Create TF-IDF matrix
        self.tfidf = TfidfVectorizer(
            stop_words='english',
            max_features=5000,  # Limit to top 5000 features
            ngram_range=(1, 2)  # Use unigrams and bigrams
        )
        
        self.tfidf_matrix = self.tfidf.fit_transform(self.movies_df['soup'])
        
        # Create index mapping
        self.movie_indices = pd.Series(
            self.movies_df.index, 
            index=self.movies_df['title']
        ).drop_duplicates()
        
        print(f"Built content model for {len(self.movies_df)} movies")
        print(f"TF-IDF matrix shape: {self.tfidf_matrix.shape}")
        
    def get_similar_movies(self, movie_title: str, n: int = 10) -> List[Tuple[str, float]]:
        """
        Get movies similar to the given movie.
        
        Parameters:
        -----------
        movie_title : str
            Title of the movie
        n : int
            Number of similar movies to return
            
        Returns:
        --------
        similar_movies : List[Tuple[str, float]]
            List of (movie_title, similarity_score) tuples
        """
        # Get movie index
        if movie_title not in self.movie_indices:
            return []
        
        idx = self.movie_indices[movie_title]
        
        # Compute cosine similarity with all other movies
        movie_vector = self.tfidf_matrix[idx]
        similarities = cosine_similarity(movie_vector, self.tfidf_matrix).flatten()
        
        # Get top N similar movies (excluding the movie itself)
        similar_indices = similarities.argsort()[::-1][1:n+1]
        
        # Return movie titles and similarity scores
        similar_movies = [
            (self.movies_df.iloc[i]['title'], similarities[i])
            for i in similar_indices
        ]
        
        return similar_movies
    
    def build_user_profile(self, user_id: str, rated_movies: Dict[str, float]):
        """
        Build a user profile from their rating history.
        
        Parameters:
        -----------
        user_id : str
            User identifier
        rated_movies : Dict[str, float]
            Dictionary mapping movie_title -> rating
        """
        # Get TF-IDF vectors for rated movies
        movie_vectors = []
        weights = []
        
        for movie_title, rating in rated_movies.items():
            if movie_title in self.movie_indices:
                idx = self.movie_indices[movie_title]
                movie_vectors.append(self.tfidf_matrix[idx].toarray().flatten())
                weights.append(rating)
        
        if not movie_vectors:
            return
        
        # Create weighted average profile
        movie_vectors = np.array(movie_vectors)
        weights = np.array(weights).reshape(-1, 1)
        
        # Normalize weights
        weights = weights / weights.sum()
        
        # Weighted average
        user_profile = (movie_vectors.T @ weights).flatten()
        
        self.user_profiles[user_id] = user_profile
        
    def recommend_for_user(self, user_id: str, rated_movies: Dict[str, float], 
                          n: int = 10) -> List[Tuple[str, float]]:
        """
        Get personalized recommendations for a user.
        
        Parameters:
        -----------
        user_id : str
            User identifier
        rated_movies : Dict[str, float]
            Movies the user has already rated
        n : int
            Number of recommendations to return
            
        Returns:
        --------
        recommendations : List[Tuple[str, float]]
            List of (movie_title, similarity_score) tuples
        """
        # Build or get user profile
        if user_id not in self.user_profiles:
            self.build_user_profile(user_id, rated_movies)
        
        user_profile = self.user_profiles[user_id]
        
        # Compute similarity with all movies
        similarities = cosine_similarity(
            user_profile.reshape(1, -1), 
            self.tfidf_matrix
        ).flatten()
        
        # Get top N movies (excluding already rated)
        rated_titles = set(rated_movies.keys())
        
        recommendations = []
        for idx in similarities.argsort()[::-1]:
            title = self.movies_df.iloc[idx]['title']
            if title not in rated_titles:
                recommendations.append((title, similarities[idx]))
                if len(recommendations) >= n:
                    break
        
        return recommendations
    
    def explain_recommendation(self, movie_title: str, user_rated_movies: List[str]):
        """
        Explain why a movie was recommended based on user's history.
        
        Parameters:
        -----------
        movie_title : str
            Recommended movie title
        user_rated_movies : List[str]
            Movies the user has rated highly
        """
        if movie_title not in self.movie_indices:
            print(f"Movie '{movie_title}' not found")
            return
        
        print(f"\nWhy we recommended '{movie_title}':")
        print("=" * 70)
        
        # Show movie features
        idx = self.movie_indices[movie_title]
        movie_data = self.movies_df.iloc[idx]
        
        print(f"\nGenres: {movie_data.get('genres', 'N/A')}")
        print(f"Director: {movie_data.get('director', 'N/A')}")
        print(f"Keywords: {movie_data.get('keywords', 'N/A')}")
        
        # Compare with user's rated movies
        print(f"\nBecause you liked:")
        for rated_movie in user_rated_movies[:3]:
            similar_movies = self.get_similar_movies(rated_movie, n=20)
            for rec_title, score in similar_movies:
                if rec_title == movie_title:
                    print(f"  - '{rated_movie}' (similarity: {score:.3f})")
                    break

# Create sample movie dataset
def create_movie_dataset():
    """
    Create a sample movie dataset with features.
    
    Returns:
    --------
    movies_df : pd.DataFrame
        DataFrame with movie information
    """
    movies = {
        'title': [
            'Toy Story',
            'Inception',
            'The Matrix',
            'Shrek',
            'The Notebook',
            'Star Wars: Episode IV',
            'Frozen',
            'The Dark Knight',
            'Titanic',
            'Avengers',
            'Interstellar',
            'The Shawshank Redemption',
            'Pulp Fiction',
            'Forrest Gump',
            'The Godfather',
            'Finding Nemo',
            'The Lion King',
            'Jurassic Park',
            'Harry Potter and the Sorcerer\'s Stone',
            'The Lord of the Rings: The Fellowship of the Ring'
        ],
        'genres': [
            'Animation Comedy Family',
            'Action Sci-Fi Thriller',
            'Action Sci-Fi',
            'Animation Comedy Family',
            'Romance Drama',
            'Action Adventure Sci-Fi',
            'Animation Family Musical',
            'Action Crime Thriller',
            'Romance Drama',
            'Action Adventure Sci-Fi',
            'Adventure Drama Sci-Fi',
            'Drama Crime',
            'Crime Drama Thriller',
            'Drama Romance Comedy',
            'Crime Drama',
            'Animation Family Adventure',
            'Animation Family Drama',
            'Action Adventure Sci-Fi',
            'Fantasy Adventure Family',
            'Adventure Fantasy Action'
        ],
        'director': [
            'John Lasseter',
            'Christopher Nolan',
            'The Wachowskis',
            'Andrew Adamson',
            'Nick Cassavetes',
            'George Lucas',
            'Chris Buck',
            'Christopher Nolan',
            'James Cameron',
            'Joss Whedon',
            'Christopher Nolan',
            'Frank Darabont',
            'Quentin Tarantino',
            'Robert Zemeckis',
            'Francis Ford Coppola',
            'Andrew Stanton',
            'Roger Allers',
            'Steven Spielberg',
            'Chris Columbus',
            'Peter Jackson'
        ],
        'keywords': [
            'toys friendship adventure',
            'dreams reality heist mind-bending',
            'virtual reality artificial intelligence rebellion',
            'ogre fairy tale friendship',
            'love memory alzheimer',
            'space rebels empire force',
            'ice magic sisters love',
            'batman joker chaos justice',
            'ship disaster love class-divide',
            'superheroes team aliens invasion',
            'space wormhole time father-daughter',
            'prison hope friendship redemption',
            'crime violence non-linear-narrative',
            'life journey simple-man',
            'mafia family power',
            'ocean fish father-son',
            'lion africa coming-of-age',
            'dinosaurs theme-park science',
            'magic wizard school friendship',
            'quest ring fellowship evil'
        ],
        'cast': [
            'Tom Hanks Tim Allen',
            'Leonardo DiCaprio Joseph Gordon-Levitt',
            'Keanu Reeves Laurence Fishburne',
            'Mike Myers Eddie Murphy',
            'Ryan Gosling Rachel McAdams',
            'Mark Hamill Harrison Ford',
            'Kristen Bell Idina Menzel',
            'Christian Bale Heath Ledger',
            'Leonardo DiCaprio Kate Winslet',
            'Robert Downey Jr Chris Evans',
            'Matthew McConaughey Anne Hathaway',
            'Tim Robbins Morgan Freeman',
            'John Travolta Samuel L. Jackson',
            'Tom Hanks Robin Wright',
            'Marlon Brando Al Pacino',
            'Albert Brooks Ellen DeGeneres',
            'Matthew Broderick James Earl Jones',
            'Sam Neill Laura Dern',
            'Daniel Radcliffe Emma Watson',
            'Elijah Wood Ian McKellen'
        ],
        'overview': [
            'A cowboy doll is profoundly threatened when a new spaceman figure supplants him as top toy.',
            'A thief who enters the dreams of others to steal secrets from their subconscious.',
            'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
            'An ogre rescues a princess from a tower and they fall in love despite their differences.',
            'A poor yet passionate young man falls in love with a rich young woman in the 1940s.',
            'Luke Skywalker joins forces with a Jedi Knight to rescue a princess and save the galaxy.',
            'When the newly crowned Queen Elsa accidentally uses her power to curse her homeland.',
            'Batman must accept one of the greatest psychological tests to fight injustice.',
            'A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious Titanic.',
            'Earth\'s mightiest heroes must come together to stop an alien invasion.',
            'A team of explorers travel through a wormhole in space in an attempt to ensure humanity\'s survival.',
            'Two imprisoned men bond over years finding solace and eventual redemption through acts of common decency.',
            'The lives of two mob hitmen, a boxer, and a pair of diner bandits intertwine in four tales of violence.',
            'The presidencies of Kennedy and Johnson unfold through the perspective of an Alabama man.',
            'The aging patriarch of an organized crime dynasty transfers control to his reluctant son.',
            'After his son is captured in the Great Barrier Reef a timid clownfish sets out on a journey to bring him home.',
            'Lion cub and future king Simba searches for his identity after the murder of his father.',
            'Scientists clone dinosaurs to populate a theme park which suffers a major security breakdown.',
            'An orphaned boy enrolls in a school of wizardry where he learns the truth about himself.',
            'A meek Hobbit embarks on a journey to destroy the One Ring and save Middle-earth.'
        ]
    }
    
    return pd.DataFrame(movies)

# Create movie dataset
movies_df = create_movie_dataset()

print("Movie Dataset:")
print("=" * 70)
print(movies_df[['title', 'genres', 'director']].to_string())
print("\n")

# Create and fit the content-based recommender
cb_recommender = ContentBasedRecommender()
cb_recommender.fit(movies_df)
print("\n")

# Example 1: Find similar movies
movie_title = "Inception"
print(f"Movies similar to '{movie_title}':")
print("=" * 70)
similar_movies = cb_recommender.get_similar_movies(movie_title, n=5)

for rank, (title, score) in enumerate(similar_movies, 1):
    print(f"{rank}. {title} (similarity: {score:.3f})")
print("\n")

# Example 2: Personalized recommendations
user_ratings = {
    'Inception': 5.0,
    'The Matrix': 5.0,
    'The Dark Knight': 4.5,
    'Interstellar': 5.0
}

print("User Rating History:")
print("=" * 70)
for movie, rating in user_ratings.items():
    print(f"  - {movie}: {rating} stars")
print("\n")

print("Personalized Recommendations:")
print("=" * 70)
recommendations = cb_recommender.recommend_for_user(
    user_id='user_001',
    rated_movies=user_ratings,
    n=5
)

for rank, (title, score) in enumerate(recommendations, 1):
    print(f"{rank}. {title} (match score: {score:.3f})")
print("\n")

# Example 3: Explain a recommendation
if recommendations:
    rec_title = recommendations[0][0]
    cb_recommender.explain_recommendation(
        rec_title, 
        list(user_ratings.keys())
    )

Movie Dataset:
======================================================================
                                                title                      genres              director
0                                           Toy Story     Animation Comedy Family         John Lasseter
1                                           Inception      Action Sci-Fi Thriller     Christopher Nolan
2                                          The Matrix               Action Sci-Fi        The Wachowskis
3                                               Shrek     Animation Comedy Family        Andrew Adamson
4                                        The Notebook               Romance Drama       Nick Cassavetes
5                               Star Wars: Episode IV     Action Adventure Sci-Fi          George Lucas
6                                              Frozen    Animation Family Musical            Chris Buck
7                                     The Dark Knight       Action Crime Thriller     Christopher Nolan
8                                             Titanic               Romance Drama         James Cameron
9                                            Avengers     Action Adventure Sci-Fi           Joss Whedon
10                                       Interstellar      Adventure Drama Sci-Fi     Christopher Nolan
11                           The Shawshank Redemption                 Drama Crime        Frank Darabont
12                                       Pulp Fiction        Crime Drama Thriller     Quentin Tarantino
13                                       Forrest Gump        Drama Romance Comedy       Robert Zemeckis
14                                      The Godfather                 Crime Drama  Francis Ford Coppola
15                                       Finding Nemo  Animation Family Adventure        Andrew Stanton
16                                      The Lion King      Animation Family Drama          Roger Allers
17                                      Jurassic Park     Action Adventure Sci-Fi      Steven Spielberg
18              Harry Potter and the Sorcerer's Stone    Fantasy Adventure Family        Chris Columbus
19  The Lord of the Rings: The Fellowship of the Ring    Adventure Fantasy Action         Peter Jackson


Built content model for 20 movies
TF-IDF matrix shape: (20, 762)


Movies similar to 'Inception':
======================================================================
1. The Matrix (similarity: 0.305)
2. The Dark Knight (similarity: 0.267)
3. Interstellar (similarity: 0.226)
4. Avengers (similarity: 0.175)
5. Star Wars: Episode IV (similarity: 0.173)


User Rating History:
======================================================================
  - Inception: 5.0 stars
  - The Matrix: 5.0 stars
  - The Dark Knight: 4.5 stars
  - Interstellar: 5.0 stars


Personalized Recommendations:
======================================================================
1. Star Wars: Episode IV (match score: 0.252)
2. Avengers (match score: 0.251)
3. Jurassic Park (match score: 0.231)
4. Pulp Fiction (match score: 0.095)
5. The Lord of the Rings: The Fellowship of the Ring (match score: 0.061)



Why we recommended 'Star Wars: Episode IV':
======================================================================

Genres: Action Adventure Sci-Fi
Director: George Lucas
Keywords: space rebels empire force

Because you liked:
  - 'Inception' (similarity: 0.173)
  - 'The Matrix' (similarity: 0.221)
  - 'The Dark Knight' (similarity: 0.040)

Understanding the TF-IDF Implementation

1. Feature Soup Creation:

def _create_soup(self, row):
    # Combine all features with different weights
    features = []
    features.extend([row['genres']] * 3)  # Genres are important!
    features.extend([row['director']] * 2)  # Director matters too
    features.append(row['keywords'])
    return ' '.join(features)

This creates a weighted “bag of words” where important features appear multiple times.

2. TF-IDF Transformation:

self.tfidf = TfidfVectorizer(
    stop_words='english',    # Remove common words (the, a, is, etc.)
    max_features=5000,       # Keep top 5000 most important words
    ngram_range=(1, 2)       # Use single words and word pairs
)

3. Similarity Computation:

similarities = cosine_similarity(movie_vector, self.tfidf_matrix)

This efficiently computes cosine similarity between one movie and all others.

4. User Profile:

# Weighted average of movie vectors
user_profile = (movie_vectors.T @ weights).flatten()

Creates a “taste profile” that combines all movies the user liked.

Comparison and Hybrid Approaches

When to Use Each Approach

Scenario	Best Approach
Cold Start - New User	Content-Based (ask preferences)
Cold Start - New Item	Content-Based (use item features)
Serendipity/Discovery	Collaborative (finds unexpected matches)
Explainability	Content-Based (can show why)
Limited Item Features	Collaborative (learns features)
Privacy Concerns	Content-Based (no user data sharing)
Scalability	Depends on implementation

Hybrid Approach

The best recommender systems combine both approaches:


class HybridRecommender:
    """
    Hybrid recommender combining collaborative and content-based filtering.
    """
    
    def __init__(self, cf_model, cb_model, cf_weight=0.5):
        """
        Initialize hybrid recommender.
        
        Parameters:
        -----------
        cf_model : CollaborativeFilteringRecommender
            Trained collaborative filtering model
        cb_model : ContentBasedRecommender
            Trained content-based model
        cf_weight : float
            Weight for collaborative filtering (0-1)
            Content-based weight = 1 - cf_weight
        """
        self.cf_model = cf_model
        self.cb_model = cb_model
        self.cf_weight = cf_weight
        self.cb_weight = 1 - cf_weight
        
    def recommend(self, user_idx: int, user_id: str, 
                 user_ratings: Dict[str, float],
                 Y: np.ndarray, n: int = 10) -> List[Tuple[str, float]]:
        """
        Get hybrid recommendations.
        
        Combines scores from both models.
        """
        # Get collaborative filtering recommendations
        cf_recs = self.cf_model.recommend_for_user(user_idx, Y, n=50)
        
        # Get content-based recommendations
        cb_recs = self.cb_model.recommend_for_user(user_id, user_ratings, n=50)
        
        # Combine scores
        combined_scores = {}
        
        # Add CF scores
        for movie_idx, score in cf_recs:
            movie_title = get_movie_title(movie_idx)  # Helper function
            combined_scores[movie_title] = self.cf_weight * score
        
        # Add CB scores
        for movie_title, score in cb_recs:
            if movie_title in combined_scores:
                combined_scores[movie_title] += self.cb_weight * score
            else:
                combined_scores[movie_title] = self.cb_weight * score
        
        # Sort by combined score
        recommendations = sorted(
            combined_scores.items(), 
            key=lambda x: x[1], 
            reverse=True
        )[:n]
        
        return recommendations

Real-World Production Considerations

1. Scalability

For millions of users and items:

# Use approximate nearest neighbors
from sklearn.neighbors import NearestNeighbors

class ScalableContentBasedRecommender:
    def __init__(self, n_neighbors=50):
        self.nn_model = NearestNeighbors(
            n_neighbors=n_neighbors,
            metric='cosine',
            algorithm='brute'  # or 'ball_tree', 'kd_tree'
        )
    
    def fit(self, item_features):
        self.nn_model.fit(item_features)
    
    def get_similar_items(self, item_vector, n=10):
        distances, indices = self.nn_model.kneighbors(
            item_vector.reshape(1, -1),
            n_neighbors=n
        )
        return indices[0], 1 - distances[0]  # Convert distance to similarity

2. Online Learning

Update models as new ratings arrive:

class OnlineCollaborativeFiltering:
    def update_user_profile(self, user_idx, new_rating, movie_idx):
        """Update user profile with new rating using SGD."""
        # Compute gradient for this single rating
        prediction = np.dot(self.W[user_idx], self.X[movie_idx]) + self.b[0, user_idx]
        error = prediction - new_rating
        
        # Update parameters
        self.W[user_idx] -= self.learning_rate * (
            error * self.X[movie_idx] + self.lambda_reg * self.W[user_idx]
        )
        self.X[movie_idx] -= self.learning_rate * (
            error * self.W[user_idx] + self.lambda_reg * self.X[movie_idx]
        )
        self.b[0, user_idx] -= self.learning_rate * error

3. A/B Testing

Always test recommendations in production:

class ABTestingRecommender:
    def __init__(self, model_a, model_b, traffic_split=0.5):
        self.model_a = model_a
        self.model_b = model_b
        self.traffic_split = traffic_split
        
    def recommend(self, user_id, **kwargs):
        # Randomly assign user to treatment or control
        if hash(user_id) % 100 < self.traffic_split * 100:
            model = self.model_a
            variant = 'A'
        else:
            model = self.model_b
            variant = 'B'
        
        recommendations = model.recommend(user_id, **kwargs)
        
        # Log for analysis
        self.log_recommendations(user_id, variant, recommendations)
        
        return recommendations

4. Diversity

Avoid filter bubbles by adding diversity:

def diverse_recommendations(recommendations, diversity_weight=0.3):
    """
    Re-rank recommendations to increase diversity.
    
    Uses Maximal Marginal Relevance (MMR).
    """
    selected = []
    candidates = recommendations.copy()
    
    while len(selected) < 10 and candidates:
        if not selected:
            # Add highest scoring item first
            selected.append(candidates.pop(0))
        else:
            # Balance relevance and diversity
            mmr_scores = []
            for cand_title, cand_score in candidates:
                # Relevance
                relevance = cand_score
                
                # Diversity (min similarity to already selected)
                diversity = min([
                    1 - cosine_similarity(cand_vector, sel_vector)
                    for sel_title, _ in selected
                ])
                
                # MMR score
                mmr = (1 - diversity_weight) * relevance + diversity_weight * diversity
                mmr_scores.append(mmr)
            
            # Select item with highest MMR
            best_idx = np.argmax(mmr_scores)
            selected.append(candidates.pop(best_idx))
    
    return selected

Conclusion

Key Takeaways

Collaborative Filtering: - ✓ Discovers hidden patterns automatically - ✓ Excellent for personalization - ✓ Improves with more data - ✗ Cold start problem - ✗ Requires sufficient user-item interactions

Content-Based Filtering: - ✓ No cold start for items - ✓ Explainable recommendations - ✓ Works with limited data - ✗ Limited serendipity - ✗ Requires feature engineering

Best Practice: Use a hybrid approach that combines the strengths of both methods!

Practice Exercises

Modify the collaborative filtering model:
- Try different learning rates and regularization parameters
- Experiment with different numbers of features (5, 10, 20, 50)
- Add early stopping based on validation set
Enhance the content-based model:
- Add numeric features (year, rating, runtime)
- Try different similarity metrics (Euclidean, Pearson)
- Implement user feedback loops
Build a hybrid system:
- Implement the hybrid recommender from scratch
- Test different weighting strategies
- Add diversity promotion
Evaluate your models:
- Implement RMSE, MAE, and precision@k metrics
- Use cross-validation
- Compare model performance

Final Thoughts

Recommender systems are at the heart of modern digital experiences. Whether you’re building the next Netflix, Spotify, or Amazon, understanding both collaborative and content-based filtering is essential.

The techniques covered here form the foundation, but modern production systems often use: - Deep learning (neural collaborative filtering) - Context-aware recommendations (time, location, device) - Multi-armed bandits (exploration vs exploitation) - Graph-based methods (social networks)

Keep learning, keep experimenting, and most importantly - keep building! 🚀

Introduction

The Recommendation Problem

The Challenge

The Data

Collaborative Filtering

Intuitive Explanation

The Core Idea

A Simple Story

The Magic Trick

Technical Deep Dive Collaborative Filtering

Mathematical Foundation

What Do These Vectors Mean?

The Cost Function

Why Regularization?

Optimization with Gradient Descent

Collaborative filtering - Full Implementation from Scratch

Understanding the Code

Content-Based Filtering

Intuitive Explanation Content-Based Filtering

The Core Idea

A Simple Story

How It Works

Technical Deep Dive Content based filtering

Feature Extraction

Similarity Metrics

User Profile Construction

Content BAsed Filtering Full Implementation with TF-IDF and Cosine Similarity

Understanding the TF-IDF Implementation

Comparison and Hybrid Approaches

When to Use Each Approach

Hybrid Approach

Real-World Production Considerations

1. Scalability

2. Online Learning

3. A/B Testing

4. Diversity

Conclusion

Key Takeaways

Further Reading

Practice Exercises

Final Thoughts