1. What is Gradient Descent?

Gradient Descent is an optimization algorithm used to find the minimum of a function.

Imagine you’re standing on a hill (a function curve).
You want to reach the bottom (the minimum).
At each step, you look at the slope (the gradient) of the hill and move downhill a little.
The learning rate controls how big each step is.
You keep repeating until you’re close enough to the bottom.

2. The Math Idea

Suppose we have a function $f (x)$ . The update rule for $x$ is:

$x_{new} = x_{old} - η \cdot f^{'} (x_{old})$

$f^{'} (x)$ = derivative (gradient)
$η$ = learning rate (step size)

For multiple variables (like in machine learning models), you just apply this rule for each parameter.

3. Python Example – Gradient Descent on a Simple Function

Let’s minimize:

$f (x) = x^{2}$

Its derivative is:

$f^{'} (x) = 2 x$

import matplotlib.pyplot as plt

# Function and derivative
def f(x):
    return x**2

def f_prime(x):
    return 2*x

# Gradient descent parameters
x = 10          # start point
learning_rate = 0.1
iterations = 20

history = [x]  # to track progress

# Gradient descent loop
for i in range(iterations):
    gradient = f_prime(x)
    x = x - learning_rate * gradient
    history.append(x)
    print(f"Iteration {i+1}: x = {x:.4f}, f(x) = {f(x):.4f}")

# Plot the path
xs = [i for i in history]
ys = [f(val) for val in history]

plt.plot(xs, ys, "o-", label="Gradient Descent Path")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.title("Gradient Descent on f(x) = x^2")
plt.legend()
plt.show()

Iteration 1: x = 8.0000, f(x) = 64.0000
Iteration 2: x = 6.4000, f(x) = 40.9600
Iteration 3: x = 5.1200, f(x) = 26.2144
Iteration 4: x = 4.0960, f(x) = 16.7772
Iteration 5: x = 3.2768, f(x) = 10.7374
Iteration 6: x = 2.6214, f(x) = 6.8719
Iteration 7: x = 2.0972, f(x) = 4.3980
Iteration 8: x = 1.6777, f(x) = 2.8147
Iteration 9: x = 1.3422, f(x) = 1.8014
Iteration 10: x = 1.0737, f(x) = 1.1529
Iteration 11: x = 0.8590, f(x) = 0.7379
Iteration 12: x = 0.6872, f(x) = 0.4722
Iteration 13: x = 0.5498, f(x) = 0.3022
Iteration 14: x = 0.4398, f(x) = 0.1934
Iteration 15: x = 0.3518, f(x) = 0.1238
Iteration 16: x = 0.2815, f(x) = 0.0792
Iteration 17: x = 0.2252, f(x) = 0.0507
Iteration 18: x = 0.1801, f(x) = 0.0325
Iteration 19: x = 0.1441, f(x) = 0.0208
Iteration 20: x = 0.1153, f(x) = 0.0133

4. Output

The x values will keep moving closer to 0 (the minimum of $x^{2}$ ).
The plot will show how gradient descent moves step by step downhill.

Linear Regression with gradient descent

1. Problem Setup

We’ll try to fit a line:

$y = m x + b$

Given some data points $(x_{i}, y_{i})$ , we want to find the best slope $m$ and intercept $b$ .

We’ll use the Mean Squared Error (MSE) as the loss function:

$J (m, b) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - (m x_{i} + b))^{2}$

2. Gradient Derivatives

To minimize $J (m, b)$ , we compute the gradients:

$\frac{\partial J}{\partial m} = - \frac{2}{n} \sum x_{i} (y_{i} - (m x_{i} + b))$

$\frac{\partial J}{\partial b} = - \frac{2}{n} \sum (y_{i} - (m x_{i} + b))$

Then update:

$m \leftarrow m - η \cdot \frac{\partial J}{\partial m}$

$b \leftarrow b - η \cdot \frac{\partial J}{\partial b}$

3. Python Implementation

import numpy as np
import matplotlib.pyplot as plt

# Fake dataset
X = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2, 4, 6, 8, 10], dtype=float)  # perfect line: y = 2x

# Initialize parameters
m = 0.0  # slope
b = 0.0  # intercept
learning_rate = 0.01
iterations = 1000
n = len(X)

# Gradient descent loop
for _ in range(iterations):
    y_pred = m * X + b
    error = y - y_pred

    # Gradients
    dm = (-2/n) * np.sum(X * error)
    db = (-2/n) * np.sum(error)

    # Update parameters
    m -= learning_rate * dm
    b -= learning_rate * db

print(f"Final slope (m): {m:.4f}")
print(f"Final intercept (b): {b:.4f}")

# Plot results
plt.scatter(X, y, label="Data")
plt.plot(X, m*X + b, color="red", label="Fitted Line")
plt.legend()
plt.show()

Final slope (m): 1.9952
Final intercept (b): 0.0174

4. Expected Result

Since the real relationship is $y = 2 x$ , gradient descent should find:

$m \approx 2$
$b \approx 0$

And the red line will match the blue data points.

Gradient Descent for Multiple Features (Multivariable Linear Regression).

1. Problem Setup

For multiple features, the model is:

$\hat{y} = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b$

or in vector form:

$\hat{y} = X w + b$

Where:

$X$ = data matrix ( $m \times n$ , with $m$ samples and $n$ features)
$w$ = weights (slopes for each feature)
$b$ = bias (intercept)

Loss function (Mean Squared Error):

$J (w, b) = \frac{1}{m} \sum_{i = 1}^{m} (y_{i} - (X_{i} w + b))^{2}$

2. Gradients

$\frac{\partial J}{\partial w} = - \frac{2}{m} X^{T} (y - \hat{y})$

$\frac{\partial J}{\partial b} = - \frac{2}{m} \sum (y - \hat{y})$

3. Python Implementation

import numpy as np

# Fake dataset (2 features)
# y = 3*x1 + 5*x2 + 10
X = np.array([
    [1, 2],
    [2, 1],
    [3, 4],
    [4, 3],
    [5, 5]
], dtype=float)

y = np.array([3*row[0] + 5*row[1] + 10 for row in X], dtype=float)

# Parameters
m, n = X.shape
weights = np.zeros(n)
bias = 0.0
learning_rate = 0.01
iterations = 1000

# Gradient descent loop
for _ in range(iterations):
    y_pred = X.dot(weights) + bias
    error = y - y_pred

    # Gradients
    dw = (-2/m) * X.T.dot(error)
    db = (-2/m) * np.sum(error)

    # Update
    weights -= learning_rate * dw
    bias -= learning_rate * db

print(f"Final weights: {weights}")
print(f"Final bias: {bias:.4f}")

# Test on new data
x_new = np.array([6, 2])
y_new = x_new.dot(weights) + bias
print(f"Prediction for {x_new}: {y_new:.2f}")

Final weights: [3.04840198 5.04775229]
Final bias: 9.6564
Prediction for [6 2]: 38.04

4. Expected Result

Since the true equation is:

$y = 3 x_{1} + 5 x_{2} + 10$

Gradient descent should converge to:

weights ≈ [3, 5]
bias ≈ 10

And predictions will match closely.