Gradient Descent Explained: The Engine Behind Machine Learning
NLP
Author
Fabian
Published
February 12, 2024
1. What is Gradient Descent?
Gradient Descent is an optimization algorithm used to find the minimum of a function.
Imagine you’re standing on a hill (a function curve).
You want to reach the bottom (the minimum).
At each step, you look at the slope (the gradient) of the hill and move downhill a little.
The learning rate controls how big each step is.
You keep repeating until you’re close enough to the bottom.
2. The Math Idea
Suppose we have a function . The update rule for is:
= derivative (gradient)
= learning rate (step size)
For multiple variables (like in machine learning models), you just apply this rule for each parameter.
3. Python Example – Gradient Descent on a Simple Function
Let’s minimize:
Its derivative is:
import matplotlib.pyplot as plt# Function and derivativedef f(x):return x**2def f_prime(x):return2*x# Gradient descent parametersx =10# start pointlearning_rate =0.1iterations =20history = [x] # to track progress# Gradient descent loopfor i inrange(iterations): gradient = f_prime(x) x = x - learning_rate * gradient history.append(x)print(f"Iteration {i+1}: x = {x:.4f}, f(x) = {f(x):.4f}")# Plot the pathxs = [i for i in history]ys = [f(val) for val in history]plt.plot(xs, ys, "o-", label="Gradient Descent Path")plt.xlabel("x")plt.ylabel("f(x)")plt.title("Gradient Descent on f(x) = x^2")plt.legend()plt.show()
Iteration 1: x = 8.0000, f(x) = 64.0000
Iteration 2: x = 6.4000, f(x) = 40.9600
Iteration 3: x = 5.1200, f(x) = 26.2144
Iteration 4: x = 4.0960, f(x) = 16.7772
Iteration 5: x = 3.2768, f(x) = 10.7374
Iteration 6: x = 2.6214, f(x) = 6.8719
Iteration 7: x = 2.0972, f(x) = 4.3980
Iteration 8: x = 1.6777, f(x) = 2.8147
Iteration 9: x = 1.3422, f(x) = 1.8014
Iteration 10: x = 1.0737, f(x) = 1.1529
Iteration 11: x = 0.8590, f(x) = 0.7379
Iteration 12: x = 0.6872, f(x) = 0.4722
Iteration 13: x = 0.5498, f(x) = 0.3022
Iteration 14: x = 0.4398, f(x) = 0.1934
Iteration 15: x = 0.3518, f(x) = 0.1238
Iteration 16: x = 0.2815, f(x) = 0.0792
Iteration 17: x = 0.2252, f(x) = 0.0507
Iteration 18: x = 0.1801, f(x) = 0.0325
Iteration 19: x = 0.1441, f(x) = 0.0208
Iteration 20: x = 0.1153, f(x) = 0.0133
4. Output
The x values will keep moving closer to 0 (the minimum of ).
The plot will show how gradient descent moves step by step downhill.
Linear Regression with gradient descent
1. Problem Setup
We’ll try to fit a line:
Given some data points , we want to find the best slope and intercept .
We’ll use the Mean Squared Error (MSE) as the loss function:
2. Gradient Derivatives
To minimize , we compute the gradients:
Then update:
3. Python Implementation
import numpy as npimport matplotlib.pyplot as plt# Fake datasetX = np.array([1, 2, 3, 4, 5], dtype=float)y = np.array([2, 4, 6, 8, 10], dtype=float) # perfect line: y = 2x# Initialize parametersm =0.0# slopeb =0.0# interceptlearning_rate =0.01iterations =1000n =len(X)# Gradient descent loopfor _ inrange(iterations): y_pred = m * X + b error = y - y_pred# Gradients dm = (-2/n) * np.sum(X * error) db = (-2/n) * np.sum(error)# Update parameters m -= learning_rate * dm b -= learning_rate * dbprint(f"Final slope (m): {m:.4f}")print(f"Final intercept (b): {b:.4f}")# Plot resultsplt.scatter(X, y, label="Data")plt.plot(X, m*X + b, color="red", label="Fitted Line")plt.legend()plt.show()
Final slope (m): 1.9952
Final intercept (b): 0.0174
4. Expected Result
Since the real relationship is , gradient descent should find:
And the red line will match the blue data points.
Gradient Descent for Multiple Features (Multivariable Linear Regression).
1. Problem Setup
For multiple features, the model is:
or in vector form:
Where:
= data matrix (, with samples and features)
= weights (slopes for each feature)
= bias (intercept)
Loss function (Mean Squared Error):
2. Gradients
3. Python Implementation
import numpy as np# Fake dataset (2 features)# y = 3*x1 + 5*x2 + 10X = np.array([ [1, 2], [2, 1], [3, 4], [4, 3], [5, 5]], dtype=float)y = np.array([3*row[0] +5*row[1] +10for row in X], dtype=float)# Parametersm, n = X.shapeweights = np.zeros(n)bias =0.0learning_rate =0.01iterations =1000# Gradient descent loopfor _ inrange(iterations): y_pred = X.dot(weights) + bias error = y - y_pred# Gradients dw = (-2/m) * X.T.dot(error) db = (-2/m) * np.sum(error)# Update weights -= learning_rate * dw bias -= learning_rate * dbprint(f"Final weights: {weights}")print(f"Final bias: {bias:.4f}")# Test on new datax_new = np.array([6, 2])y_new = x_new.dot(weights) + biasprint(f"Prediction for {x_new}: {y_new:.2f}")
Final weights: [3.04840198 5.04775229]
Final bias: 9.6564
Prediction for [6 2]: 38.04