Introduction

Imagine trying to bake a cake without knowing how much flour, sugar, or eggs to use. You might end up with a disaster! Similarly, in marketing, blindly allocating budgets across different channels is a recipe for wasted resources. That’s where Marketing Mix Modeling (MMM) comes in.

MMM is a powerful analytical technique that helps marketers understand how each marketing activity contributes to sales or conversions. By quantifying the impact of TV ads, online campaigns, and other initiatives, MMM empowers marketers to optimize their spending and maximize ROI.

In this blog post, we’ll check out the world of MMM using Python. We’ll explore three different approaches to build these models:

Ordinary Least Squares (OLS) Regression: A classic statistical technique for understanding linear relationships.
Bayesian Modeling: A more flexible approach that allows us to incorporate prior knowledge and quantify uncertainty.
Machine Learning: For capturing complex, non-linear patterns in marketing data.

We’ll use popular Python libraries like statsmodels, pymc-marketing, and scikit-learn to build our models, and we’ll emphasize visualizations to make the insights clear and actionable.

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Update the data dictionary with new keys for Google, TikTok, and Facebook spends
data = {
    'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']),
    'TV': [100, 150, 200, 120, 180],
    'Radio': [50, 70, 60, 80, 90],
    'Google': [20, 30, 25, 35, 40],
    'TikTok': [15, 25, 20, 30, 35],
    'Facebook': [10, 15, 12, 18, 22],
    'Sales': [250, 300, 350, 280, 320]
}

# Generate 50 rows of data
num_rows = 50
data = {
    'Date': pd.to_datetime(['2023-01-01'] * num_rows) + pd.to_timedelta(np.arange(num_rows), unit='D'),
    'TV': np.random.randint(80, 200, size=num_rows),
    'Radio': np.random.randint(40, 100, size=num_rows),
    'Google': np.random.randint(10, 50, size=num_rows),
    'TikTok': np.random.randint(10, 50, size=num_rows),
    'Facebook': np.random.randint(10, 50, size=num_rows),
    'Sales': np.random.randint(200, 400, size=num_rows)
}

df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

# Display the first 5 rows
print(df.head().to_markdown(index=True, numalign="left", stralign="left"))

# Print the column names and their data types
print(df.info())

| Date                | TV   | Radio   | Google   | TikTok   | Facebook   | Sales   |
|:--------------------|:-----|:--------|:---------|:---------|:-----------|:--------|
| 2023-01-01 00:00:00 | 90   | 70      | 40       | 25       | 13         | 219     |
| 2023-01-02 00:00:00 | 141  | 99      | 27       | 29       | 14         | 254     |
| 2023-01-03 00:00:00 | 126  | 66      | 17       | 33       | 30         | 370     |
| 2023-01-04 00:00:00 | 96   | 80      | 30       | 44       | 23         | 338     |
| 2023-01-05 00:00:00 | 170  | 53      | 22       | 40       | 20         | 363     |
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 50 entries, 2023-01-01 to 2023-02-19
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   TV        50 non-null     int64
 1   Radio     50 non-null     int64
 2   Google    50 non-null     int64
 3   TikTok    50 non-null     int64
 4   Facebook  50 non-null     int64
 5   Sales     50 non-null     int64
dtypes: int64(6)
memory usage: 2.7 KB
None

# Basic stats for the data
print(df.describe())

               TV      Radio     Google    TikTok   Facebook       Sales
count   50.000000  50.000000  50.000000  50.00000  50.000000   50.000000
mean   139.960000  67.480000  28.120000  31.48000  30.240000  296.480000
std     36.882029  17.054822  11.007678  11.89107  12.251714   56.894546
min     81.000000  41.000000  11.000000  10.00000  11.000000  200.000000
25%    108.500000  52.250000  18.500000  22.50000  18.250000  247.250000
50%    138.000000  66.500000  26.500000  31.00000  30.000000  292.500000
75%    170.000000  77.500000  38.750000  41.00000  43.000000  350.000000
max    198.000000  99.000000  48.000000  49.00000  49.000000  393.000000

# EDA Visualizations
sns.pairplot(df, x_vars=['TV', 'Radio', 'Google', 'TikTok', 'Facebook'], y_vars='Sales', kind='reg')
plt.show()

Pair Plots: A First Glance

The pair plots provide a visual snapshot of how each marketing channel relates to sales. We can see the individual data points and get a sense of the overall trend.

Linear Relationships: TV and Radio seem to have a more linear relationship with Sales – as spending increases, sales tend to increase in a fairly consistent way. This suggests that a simple linear model might be a good starting point for these channels.
Less Clear-Cut Relationships: The digital channels (Google, TikTok, and Facebook) show more scattered patterns, making it harder to discern a clear trend. This could indicate non-linear relationships, interactions between channels, or simply more noise in the data. We might need more sophisticated models to capture these complexities.

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

Heatmap: Quantifying the Relationships

The heatmap takes our understanding a step further by quantifying the correlations between each pair of variables.

Strongest Correlation: As expected, TV advertising shows the strongest positive correlation with sales. This suggests that TV ads are a significant driver of sales for this business.
Other Correlations: Radio also shows a positive correlation with sales, though weaker than TV. Interestingly, the digital channels have weaker correlations, and Facebook even shows a slight negative relationship.

Note

A Note of Caution: Spurious Correlations

It’s important to remember that correlation doesn’t necessarily equal causation. The negative correlation with Facebook might be a “spurious correlation,” meaning it’s likely due to random chance or other factors not captured in our dataset, rather than a true negative impact of Facebook advertising. With limited data, it’s easier to observe misleading correlations.

Model 1: Ordinary Least Squares (OLS) Regression

OLS regression is a workhorse of statistical modeling. It assumes a linear relationship between the predictor variables (marketing spends) and the outcome variable (sales). We’ll fit a model of the form:

Sales = b0 + b1 * TV + b2 * Radio + b3 * Google + b4 * TikTok + b5 * Facebook + error

where the b coefficients represent the impact of each channel on sales.

# Build the OLS model
model_ols = sm.ols(formula='Sales ~ TV + Radio + Google + TikTok + Facebook', data=df).fit()

# Print the model summary
print(model_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Sales   R-squared:                       0.138
Model:                            OLS   Adj. R-squared:                  0.041
Method:                 Least Squares   F-statistic:                     1.414
Date:                Sat, 26 Oct 2024   Prob (F-statistic):              0.238
Time:                        09:52:23   Log-Likelihood:                -268.78
No. Observations:                  50   AIC:                             549.6
Df Residuals:                      44   BIC:                             561.0
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    176.7685     66.743      2.648      0.011      42.256     311.281
TV             0.4782      0.222      2.153      0.037       0.031       0.926
Radio          0.1373      0.501      0.274      0.785      -0.872       1.147
Google         0.0278      0.772      0.036      0.971      -1.529       1.585
TikTok         0.8526      0.674      1.265      0.212      -0.505       2.211
Facebook       0.5259      0.701      0.750      0.457      -0.887       1.939
==============================================================================
Omnibus:                        2.774   Durbin-Watson:                   1.777
Prob(Omnibus):                  0.250   Jarque-Bera (JB):                2.234
Skew:                          -0.382   Prob(JB):                        0.327
Kurtosis:                       2.300   Cond. No.                     1.41e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.41e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Note

The OLS model summary provides us with key information:

Coefficients: These tell us the estimated increase in sales for a one-unit increase in spending on each channel, holding other variables constant. For example, a positive coefficient for TV indicates that increasing TV advertising is associated with higher sales.
R-squared: This measures the proportion of variance in sales explained by the model. A higher R-squared indicates a better fit.
P-values: These help us assess the statistical significance of each coefficient. Low p-values (typically below 0.05) suggest that the corresponding channel has a statistically significant impact on sales.

The OLS regression model indicates that TV advertising has a statistically significant positive effect on sales, while the impact of other marketing channels remains uncertain; however, the low R-squared suggests that this model explains only a small portion of the variability in sales, and potential multicollinearity warrants further investigation.

Further visualizations will help us understand the model better.

# Visualizations
# 1. Regression plot (example with TV)
sns.regplot(x='TV', y='Sales', data=df)
plt.show()

The regression plot shows the fitted line and the data points for a specific channel (TV in this case).

# 2. Residual plot
plt.scatter(model_ols.fittedvalues, model_ols.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

The residual plot helps us check the assumptions of the OLS model. Ideally, the residuals should be randomly scattered around zero.

# 3. Actual vs. Predicted
plt.scatter(model_ols.fittedvalues, df['Sales'])
plt.xlabel('Predicted Sales')
plt.ylabel('Actual Sales')
plt.title('Actual vs. Predicted Sales')
plt.show()

The actual vs. predicted plot shows how well the model predicts sales. A good model will have points clustered closely around the diagonal line.

Model 2: Bayesian Marketing Mix Modeling

While OLS regression provides a good starting point, Bayesian methods offer a more flexible and nuanced approach. They allow us to:

Incorporate prior knowledge: We can include our existing beliefs about the effectiveness of different marketing channels.
Quantify uncertainty: Instead of point estimates, we get probability distributions for the model parameters, giving us a better understanding of the range of possible values.

We’ll use the pymc library, which provides tools specifically designed for Bayesian MMM.

import pymc as pm

# Prepare the data for PyMC
X = df[['TV', 'Radio', 'Google', 'TikTok', 'Facebook']]
y = df['Sales']

# Define the Bayesian model
with pm.Model() as model_bayesian:
    # Priors for the coefficients (weakly informative priors)
    intercept = pm.Normal('intercept', mu=0, sigma=100)
    beta_tv = pm.HalfNormal('beta_tv', sigma=10)  # Assuming a non-negative effect of TV
    beta_radio = pm.HalfNormal('beta_radio', sigma=10)
    beta_google = pm.Normal('beta_google', mu=0, sigma=10)
    beta_tiktok = pm.Normal('beta_tiktok', mu=0, sigma=10)
    beta_facebook = pm.Normal('beta_facebook', mu=0, sigma=10)

    # Linear model
    mu = intercept + beta_tv*X['TV'] + beta_radio*X['Radio'] + \
         beta_google*X['Google'] + beta_tiktok*X['TikTok'] + beta_facebook*X['Facebook'] 

    # Likelihood (assuming sales are normally distributed)
    sigma = pm.HalfNormal('sigma', sigma=100)
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma, observed=y)

    # Sample from the posterior distribution
    trace = pm.sample(2000, tune=1000)

# Analyze the results
pm.summary(trace)

# Visualizations
# 1. Trace plots
pm.plot_trace(trace) 
plt.show()

# 2. Posterior distributions
pm.plot_posterior(trace) 
plt.show()

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [intercept, beta_tv, beta_radio, beta_google, beta_tiktok, beta_facebook, sigma]

Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 8 seconds.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.

Let’s dive deeper into how this Bayesian model works. It’s like being a detective with some initial hunches (priors), gathering evidence (data), and updating your beliefs (posteriors) to crack the case of how marketing drives sales.

1. Setting the Stage: Priors

Think of priors as our starting assumptions about the influence of each marketing channel. We’ve used “weakly informative priors” here. It’s like saying, “We don’t have strong opinions yet, but we generally expect marketing to have a positive effect, though we’re open to being surprised.” For example, we used a HalfNormal distribution for TV and Radio, suggesting we expect their coefficients to be positive (more spending likely leads to more sales).

2. The Plot Thickens: Likelihood

The likelihood function is our detective’s magnifying glass – it helps us examine how well our model explains the observed sales data. We assume our sales data follows a normal distribution, with the mean influenced by our marketing spends.

3. The Investigation Unfolds: MCMC

Now, the exciting part! We use a technique called Markov Chain Monte Carlo (MCMC) – imagine our detective exploring different scenarios and gathering clues. MCMC helps us explore the vast space of possible parameter values and figure out which combinations best fit our data and priors. pm.sample() is our detective’s trusty sidekick, doing the legwork of this exploration.

4. Cracking the Case: Posteriors

After a thorough investigation (MCMC sampling), we get the posterior distributions – our updated beliefs about the impact of each marketing channel. pm.summary() neatly summarizes these distributions, giving us the average estimated effect, the uncertainty around it, and credible intervals (like a confidence interval, but with a Bayesian twist).

5. Visualizing the Evidence

Finally, we visualize our findings!

Trace Plots: Following the Detective’s Trail

Imagine our detective meticulously documenting every step of their investigation in a notebook. That’s essentially what a trace plot is – a visual record of the MCMC algorithm’s journey as it explores the parameter space.

What to look for:
- Convergence: We want to see the trace plot resemble a “hairy caterpillar” – a dense band of lines meandering around a central value. This indicates that the algorithm has “converged,” meaning it has settled into a stable region of the parameter space and is reliably sampling from the posterior distribution.
- No trends or patterns: If we see clear trends (upward, downward) or repeating patterns, it suggests the algorithm might be stuck and not exploring the full posterior.
- Good mixing: The lines should jump around frequently, indicating the algorithm is efficiently exploring different possibilities. If the lines get stuck in one area for a long time, it suggests poor mixing.
Why it matters:
- Reliability: Convergence assures us that the samples we’re getting are representative of the true posterior distribution. Without convergence, our inferences about the model parameters might be misleading.
- Diagnosing problems: Trace plots can help us identify issues with the model or the MCMC settings. For example, if we see poor mixing, we might need to adjust the algorithm’s parameters or re-evaluate our model.

Posterior Distributions: Visualizing the Range of Possibilities

Posterior distributions are the culmination of our Bayesian investigation. They show us the range of plausible values for each parameter after considering both our prior knowledge and the evidence from the data.

What to look for:
- Shape: The shape of the distribution tells us about the uncertainty and the most likely values. A narrow, peaked distribution indicates high certainty, while a wide, spread-out distribution suggests more uncertainty.
- Location: The center of the distribution gives us the most likely value of the parameter.
- Credible intervals: These intervals (often shown as shaded regions) give us a range of values where the parameter is likely to fall with a certain probability (e.g., 95% credible interval).
Why it matters:
- Quantifying uncertainty: Unlike traditional methods that give us single point estimates, Bayesian models provide a full picture of uncertainty. This is crucial for making informed decisions, as it allows us to account for the range of possible outcomes.
- Comparing parameters: We can compare the posterior distributions of different parameters to see which ones have a stronger influence on the outcome.
- Making predictions: We can use the posterior distributions to generate predictions and assess their uncertainty.

In essence:

Trace plots help us assess the process of Bayesian inference (MCMC sampling).
Posterior distributions show us the results of that inference – our updated knowledge about the model parameters.

By carefully examining these visualizations, we can gain a deeper understanding of our Bayesian model, assess its reliability, and draw meaningful conclusions about the relationships between marketing activities and sales.

Model 3: Machine Learning for Marketing Mix Modeling

Machine learning models can capture more complex relationships in the data, such as non-linear effects and interactions between marketing channels. We’ll use a Random Forest Regressor, a powerful ensemble method known for its predictive accuracy.

Machine learning models are like the sophisticated detectives of the data world. They can uncover complex patterns and relationships that might be missed by simpler methods. Here, we’ll use a Random Forest Regressor – imagine a team of detectives, each with a slightly different perspective, combining their insights to solve the case.

1. Training and Testing: A Detective’s Prep Work

Before our detective team tackles the case, they need some practice. We split our data into two sets:

Training set: This is where the detectives hone their skills. They study the relationships between marketing spends and sales, learning the patterns and nuances.
Testing set: This is the real test. The detectives apply their learned knowledge to a new set of data, allowing us to see how well they generalize to unseen situations.

2. Building the Model: Assembling the Detective Team

We train our Random Forest model – essentially, we assemble our team of detective “trees,” each specializing in a different aspect of the data. They work together, combining their individual predictions to arrive at a final, more accurate prediction.

3. Evaluating Performance: Assessing the Detectives’ Accuracy

How good are our detectives at predicting sales? We use metrics like:

Mean Squared Error (MSE): This measures the average squared difference between the predicted and actual sales. A lower MSE means better accuracy.
R-squared: This tells us the proportion of variance in sales explained by the model. A higher R-squared indicates a better fit.

4. Feature Importance: Identifying the Key Clues

Our detective team can also tell us which clues are most important in solving the case. Feature importance scores reveal which marketing channels have the strongest influence on sales. This helps us prioritize our marketing efforts and allocate resources effectively.

5. Partial Dependence Plots: Isolating the Effects

Partial dependence plots are like isolating a suspect for questioning. They help us understand the relationship between a specific marketing channel (e.g., TV) and sales, while holding other channels constant. This allows us to see the unique contribution of each channel, independent of the others.”

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest model
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model_rf.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 5809.207090000002
R-squared: -0.3276852362276712

# Visualizations
# 1. Feature importance
importances = model_rf.feature_importances_
features = X.columns
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

from sklearn.inspection import PartialDependenceDisplay

display = PartialDependenceDisplay.from_estimator(model_rf, X_train, features=[0]) # 0 represents 'TV'
display.plot()
plt.show()

Model Comparison and Selection

We’ve now explored three different approaches to marketing mix modeling: OLS regression, Bayesian modeling, and a machine learning model (Random Forest). Each has its own strengths and weaknesses:

OLS Regression: Simple, interpretable, but assumes linearity and might be sensitive to outliers.
Bayesian Modeling: More flexible, incorporates prior knowledge, quantifies uncertainty, but can be computationally more intensive.
Machine Learning: Can capture complex relationships, often has high predictive accuracy, but can be less interpretable.

To choose the best model, consider the following:

Data complexity: For simple, linear relationships, OLS might suffice. For more complex patterns, consider Bayesian or machine learning models.
Interpretability: If understanding the drivers of sales is crucial, OLS and Bayesian models are generally more interpretable.
Predictive accuracy: If the primary goal is accurate prediction, machine learning models often excel.
Business context: The choice should align with the specific business needs and the level of sophistication required.

In our example, let’s say interpretability is key. We might favor the Bayesian model because it provides probability distributions for the coefficients, giving us a better understanding of the uncertainty associated with the estimates. However, if we had a large dataset with complex interactions, the Random Forest might be a strong contender.

Conclusion

Marketing mix modeling is a vital tool for data-driven marketers. By quantifying the impact of different marketing activities, MMM enables informed decision-making, budget optimization, and ROI maximization.

In this post, we journeyed through three different MMM approaches in Python, highlighting the importance of data exploration, model selection, and result interpretation. Remember, the best model depends on your specific needs and data characteristics.

By embracing these techniques, marketers can move beyond guesswork and allocate resources strategically, driving business growth and success.