Learning Statistics and Probability with SciPy and Python

This Jupyter Notebook provides a comprehensive guide to the fundamentals of statistics and probability, using SciPy and Python for practical implementation and visualization.

I. Introduction

Why Statistics and Probability Matter:

Statistics and probability are crucial for understanding and interpreting data across various fields:

Science: Analyzing experimental results, drawing conclusions, and making predictions.
Business: Understanding market trends, customer behavior, and financial risk.
Healthcare: Evaluating treatment effectiveness, predicting disease outbreaks, and improving patient outcomes.
Everyday Life: Making informed decisions based on data, understanding risk, and interpreting information.

The Role of SciPy:

SciPy (Scientific Python) is a powerful open-source library offering a wide range of scientific computing tools:

Statistical functions: Calculate descriptive statistics, probability distributions, hypothesis tests, and more.
Optimization algorithms: Find optimal solutions to mathematical problems.
Numerical integration and differentiation: Solve calculus problems numerically.
Signal processing: Analyze and manipulate signals.

Python for Visualization:

Python provides excellent libraries for creating insightful visualizations:

Matplotlib: A versatile library for creating static, interactive, and animated visualizations.
Seaborn: Built on Matplotlib, Seaborn provides a high-level interface for creating statistically informative and visually appealing plots.

II. Descriptive Statistics

Data Types and Structures:

Data can be classified into different types:

Numerical: Quantitative data that can be measured (e.g., height, weight, temperature).
Categorical: Qualitative data that can be categorized (e.g., gender, color, city).

Data can be organized in various structures:

Arrays: A collection of elements of the same data type.
DataFrames: A two-dimensional table-like structure with rows and columns.

import numpy as np
import pandas as pd

# Create a NumPy array
data = np.array([1, 2, 3, 4, 5])

# Create a Pandas DataFrame
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 28],
                   'City': ['New York', 'London', 'Paris']})

print(data)
print(df)

[1 2 3 4 5]
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris

Measures of Central Tendency:

Measures of central tendency describe the center of a dataset:

Mean: The average of all values. Median: The middle value when the data is sorted. Mode: The most frequent value.

from scipy import stats

# Calculate mean, median, and mode
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode[0]}")

Mean: 3.0
Median: 3.0
Mode: 1

Visualize central tendency with histograms and box plots:

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
plt.hist(data)
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Box plot
sns.boxplot(data)
plt.title('Box Plot')
plt.show()

Measures of Dispersion:

Measures of dispersion describe the spread or variability of a dataset:

Variance: The average squared deviation from the mean. Standard deviation: The square root of the variance. Range: The difference between the maximum and minimum values.

# Calculate variance, standard deviation, and range
variance = np.var(data)
std_dev = np.std(data)
data_range = np.ptp(data)  # ptp stands for "peak-to-peak"

print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
print(f"Range: {data_range}")

Variance: 2.0
Standard Deviation: 1.4142135623730951
Range: 4

Visualize dispersion with box plots and scatter plots:

# Generate some random data for scatter plot
x = np.random.rand(50)
y = 2 * x + np.random.rand(50)

# Scatter plot
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Correlation and Covariance:

Correlation: Measures the linear relationship between two variables. Covariance: Measures how two variables vary together.

# Calculate covariance and correlation coefficient
covariance = np.cov(x, y)[0, 1]
correlation = np.corrcoef(x, y)[0, 1]

print(f"Covariance: {covariance}")
print(f"Correlation: {correlation}")

Covariance: 0.1430586029861512
Correlation: 0.882889317406764

Visualize relationships with scatter plots and heatmaps:

# Heatmap (requires multiple variables)
data = np.random.rand(10, 10)
sns.heatmap(data, annot=True)
plt.title('Heatmap')
plt.show()

Probability Basic Concepts:

Probability is the branch of mathematics that deals with the likelihood of events occurring. Here are some fundamental concepts:

Probability: A number between 0 and 1 that represents the chance of an event happening. A probability of 0 means the event is impossible, while a probability of 1 means the event is certain. Sample Space: The set of all possible outcomes of an experiment. For example, the sample space of flipping a coin is {Heads, Tails}. Events: A subset of the sample space. For example, getting Heads when flipping a coin is an event.

import random

# Simulate flipping a coin 10 times
outcomes = [random.choice(['Heads', 'Tails']) for _ in range(10)]
print(outcomes)

# Count the number of heads
num_heads = outcomes.count('Heads')
print(f"Number of heads: {num_heads}")

# Estimate the probability of heads
prob_heads = num_heads / len(outcomes)
print(f"Estimated probability of heads: {prob_heads}")

['Heads', 'Heads', 'Tails', 'Heads', 'Tails', 'Tails', 'Heads', 'Heads', 'Tails', 'Tails']
Number of heads: 5
Estimated probability of heads: 0.5

Probability Distributions:

A probability distribution describes the probability of different outcomes in an experiment. There are two main types:

Discrete Distributions: Deal with discrete random variables, which can only take on a finite number of values.

Bernoulli Distribution: Models the probability of success or failure in a single trial (e.g., flipping a coin once). Binomial Distribution: Models the probability of a certain number of successes in a fixed number of independent trials (e.g., flipping a coin 10 times and getting 5 heads). Poisson Distribution: Models the probability of a certain number of events occurring in a fixed interval of time or space (e.g., the number of cars passing a certain point on a highway in an hour).

from scipy.stats import bernoulli, binom, poisson

# Bernoulli distribution
p = 0.5  # Probability of success
rv_bernoulli = bernoulli(p)
print(f"Bernoulli PMF: {rv_bernoulli.pmf(1)}")  # Probability of success (1)

# Binomial distribution
n = 10  # Number of trials
rv_binom = binom(n, p)
print(f"Binomial PMF: {rv_binom.pmf(5)}")  # Probability of 5 successes

# Poisson distribution
mu = 3  # Average number of events
rv_poisson = poisson(mu)
print(f"Poisson PMF: {rv_poisson.pmf(2)}")  # Probability of 2 events

Bernoulli PMF: 0.5
Binomial PMF: 0.2460937500000002
Poisson PMF: 0.22404180765538775

Visualize probability mass functions (PMFs):

x_binom = np.arange(0, n + 1)
plt.bar(x_binom, rv_binom.pmf(x_binom))
plt.title('Binomial PMF')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.show()

x_poisson = np.arange(0, 10)
plt.bar(x_poisson, rv_poisson.pmf(x_poisson))
plt.title('Poisson PMF')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.show()

Continuous Distributions: Deal with continuous random variables, which can take on any value within a given range.

Normal Distribution: A bell-shaped distribution that is commonly used to model many natural phenomena (e.g., height, weight, IQ scores). Exponential Distribution: Models the time between events that happen at a constant average rate (e.g., the time between customer arrivals at a store). Uniform Distribution: All values within a given range are equally likely (e.g., a random number generator that produces numbers between 0 and 1).

from scipy.stats import norm, expon, uniform

# Normal distribution
mu = 0  # Mean
sigma = 1  # Standard deviation
rv_norm = norm(mu, sigma)
print(f"Normal PDF: {rv_norm.pdf(1)}")  # Probability density at x = 1

# Exponential distribution
lambda_ = 1  # Rate parameter
rv_expon = expon(scale=1/lambda_)
print(f"Exponential PDF: {rv_expon.pdf(1)}")

# Uniform distribution
a = 0  # Lower bound
b = 1  # Upper bound
rv_uniform = uniform(a, b)
print(f"Uniform PDF: {rv_uniform.pdf(0.5)}")

Normal PDF: 0.24197072451914337
Exponential PDF: 0.36787944117144233
Uniform PDF: 1.0

Visualize probability density functions (PDFs):

x_norm = np.linspace(-3, 3, 100)
plt.plot(x_norm, rv_norm.pdf(x_norm))
plt.title('Normal PDF')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.show()

x_expon = np.linspace(0, 5, 100)
plt.plot(x_expon, rv_expon.pdf(x_expon))
plt.title('Exponential PDF')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.show()

Conditional Probability and Bayes’ Theorem:

Conditional Probability: The probability of an event happening given that another event has already occurred. Independence: Two events are independent if the occurrence of one does not affect the probability of the other. Bayes’ Theorem: A way to update the probability of an event based on new evidence.

# Example: Two events A and B
p_a = 0.3  # Probability of A
p_b = 0.4  # Probability of B
p_b_given_a = 0.5  # Probability of B given A

# Calculate P(A|B) using Bayes' Theorem
p_a_given_b = (p_b_given_a * p_a) / p_b
print(f"P(A|B): {p_a_given_b}")

P(A|B): 0.37499999999999994

Inferential Statistics Inferential statistics allows us to draw conclusions about a population based on a sample of data. It involves using sample data to estimate population parameters and test hypotheses.

Sampling and Estimation:

Sampling: The process of selecting a subset of individuals from a population. Different sampling methods include:

Random sampling: Every member of the population has an equal chance of being selected. Stratified sampling: The population is divided into subgroups (strata), and random samples are taken from each stratum. Cluster sampling: The population is divided into clusters, and a random sample of clusters is selected. Estimation: The process of using sample data to estimate population parameters.

Point estimate: A single value that estimates a population parameter (e.g., sample mean as an estimate of population mean). Confidence interval: A range of values that is likely to contain the population parameter with a certain level of confidence (e.g., a 95% confidence interval).

from scipy.stats import t

# Sample data
data = np.random.normal(loc=50, scale=10, size=100)  # Generate random data from a normal distribution

# Calculate sample mean and standard deviation
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # Use ddof=1 for sample standard deviation

# Calculate 95% confidence interval
confidence_level = 0.95
alpha = 1 - confidence_level
degrees_of_freedom = len(data) - 1
t_critical = t.ppf(1 - alpha/2, degrees_of_freedom)  # Get t-critical value
margin_of_error = t_critical * sample_std / np.sqrt(len(data))
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print(f"Sample Mean: {sample_mean}")
print(f"95% Confidence Interval: {confidence_interval}")

Sample Mean: 51.23104687422396
95% Confidence Interval: (49.39287919200277, 53.069214556445154)

Hypothesis Testing:

Hypothesis testing is a process for evaluating evidence against a claim about a population.

Null hypothesis (H0): The claim being tested, usually stating that there is no effect or difference. Alternative hypothesis (H1 or Ha): The opposite of the null hypothesis, stating that there is an effect or difference. Type I error: Rejecting the null hypothesis when it is actually true (false positive). Type II error: Failing to reject the null hypothesis when it is actually false (false negative).

from scipy.stats import ttest_1samp

# One-sample t-test
population_mean = 55  # Hypothesized population mean
t_statistic, p_value = ttest_1samp(data, population_mean)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
# Interpret results: Compare p-value to a significance level (e.g., 0.05)
# If p-value < significance level, reject the null hypothesis.

T-statistic: -4.0684104904780725
P-value: 9.521265552089947e-05

Visualize test results with distributions and p-values:

# Visualize t-distribution and critical region
x = np.linspace(-4, 4, 200)
plt.plot(x, t.pdf(x, df=degrees_of_freedom))
plt.title('T-distribution')
plt.xlabel('t-statistic')
plt.ylabel('Probability Density')
plt.fill_between(x, t.pdf(x, df=degrees_of_freedom), where=(x > t_critical), color='red', alpha=0.5)
plt.fill_between(x, t.pdf(x, df=degrees_of_freedom), where=(x < -t_critical), color='red', alpha=0.5)
plt.show()

Regression Analysis:

Regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

Linear regression: Models a linear relationship between variables. Assumptions: Linearity, independence, homoscedasticity, normality of residuals.

import statsmodels.formula.api as sm

# Sample data
x = np.linspace(0, 10, 100)
y = 2 * x + np.random.normal(scale=2, size=100)

# Create a DataFrame
df_regression = pd.DataFrame({'x': x, 'y': y})

# Fit linear regression model
model = sm.ols('y ~ x', data=df_regression).fit()
print(model.summary())

# Visualize regression line and residuals
sns.regplot(x='x', y='y', data=df_regression)
plt.title('Linear Regression')
plt.show()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.918
Model:                            OLS   Adj. R-squared:                  0.917
Method:                 Least Squares   F-statistic:                     1101.
Date:                Sat, 04 Jan 2025   Prob (F-statistic):           4.21e-55
Time:                        20:09:16   Log-Likelihood:                -206.62
No. Observations:                 100   AIC:                             417.2
Df Residuals:                      98   BIC:                             422.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9313      0.383     -2.431      0.017      -1.692      -0.171
x              2.1964      0.066     33.187      0.000       2.065       2.328
==============================================================================
Omnibus:                        0.348   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.840   Jarque-Bera (JB):                0.504
Skew:                           0.108   Prob(JB):                        0.777
Kurtosis:                       2.728   Cond. No.                         11.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

V. Advanced Topics (Optional) This section introduces some advanced topics in statistics and probability that you can explore further based on your interests and goals.

Bayesian Statistics:

Bayesian Inference: An alternative approach to statistical inference that uses Bayes’ Theorem to update prior beliefs about a population parameter based on observed data, resulting in a posterior distribution. Prior Distribution: Represents prior knowledge or beliefs about the parameter before observing any data. Posterior Distribution: Represents updated beliefs about the parameter after observing data.

# # This requires installing pymc3:  !pip install pymc3
# import pymc3 as pm

# # Example: Estimating the probability of heads in a coin flip
# data = np.array([1, 1, 0, 1, 0, 0, 1, 1, 1, 0])  # 1 represents heads, 0 represents tails

# with pm.Model() as model:
#     # Prior distribution for p (probability of heads)
#     p = pm.Beta('p', alpha=1, beta=1)  # Uniform prior

#     # Likelihood function (Bernoulli distribution)
#     likelihood = pm.Bernoulli('likelihood', p=p, observed=data)

#     # Inference
#     trace = pm.sample(2000, tune=1000)

# # Analyze posterior distribution
# pm.traceplot(trace)
# plt.show()
# pm.summary(trace)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[23], line 2
      1 # This requires installing pymc3:  !pip install pymc3
----> 2 import pymc3 as pm
      4 # Example: Estimating the probability of heads in a coin flip
      5 data = np.array([1, 1, 0, 1, 0, 0, 1, 1, 1, 0])  # 1 represents heads, 0 represents tails

ModuleNotFoundError: No module named 'pymc3'

Time Series Analysis:

Time series analysis deals with data collected over time. It involves identifying patterns, trends, and seasonality in the data.

Time Series Decomposition: Separating a time series into its components: trend, seasonality, and residuals. Forecasting: Predicting future values based on past patterns.

# Sample time series data
dates = pd.date_range('2023-01-01', periods=100, freq='D')
data = np.sin(2 * np.pi * np.arange(100) / 30) + np.random.normal(scale=0.5, size=100)
df_time_series = pd.DataFrame({'Date': dates, 'Value': data})

# Plot time series
plt.plot(df_time_series['Date'], df_time_series['Value'])
plt.title('Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

# Decompose time series (requires statsmodels)
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df_time_series['Value'], model='additive', period=30)
result.plot()
plt.show()

Machine Learning:

Machine learning uses algorithms to learn patterns from data and make predictions or decisions.

Classification: Assigning data points to categories (e.g., spam detection, image recognition). Clustering: Grouping data points into clusters based on similarity (e.g., customer segmentation).

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample data for classification
X = np.random.rand(100, 2)
y = (X[:, 0] + X[:, 1] > 1).astype(int)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

These advanced topics provide a glimpse into the broader world of statistics and probability. There are many other areas to explore, such as survival analysis, spatial statistics, and stochastic processes.

Conclusion Summary of Key Concepts:

This notebook covered fundamental concepts in statistics and probability, including:

Descriptive statistics: Measures of central tendency, dispersion, and correlation. Probability: Basic concepts, probability distributions, conditional probability, and Bayes’ Theorem. Inferential statistics: Sampling, estimation, hypothesis testing, and regression analysis. Advanced topics: Bayesian statistics, time series analysis, and machine learning.