Statistical Thinking in Python

Statistical thinking is fundamental for machine learning and AI. Since Python is the language of choice for these technologies, we will explore how to write Python programs that incorporate statistical analysis. In this article, we will create graphs and charts using various Python modules to analyze data quickly and derive insights through visualization.

Data Preparation

We'll use a dataset containing information about various seeds. This dataset is available on Kaggle and has eight columns that we'll use to create different types of charts for comparing seed features. The program below loads the dataset and displays sample rows.

Example

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load the seeds dataset
datainput = pd.read_csv('seeds.csv')
# Dataset available at: https://www.kaggle.com/jmcaro/wheat-seedsuci
print(datainput.head())

Running the above code gives us the following result ?

    Area  Perimeter  Compactness  Length_of_Kernel  Width_of_Kernel  \
0  15.26      14.84       0.8710             5.763            3.312   
1  14.88      14.57       0.8811             5.554            3.333   
2  14.29      14.09       0.9050             5.291            3.337   
3  13.84      13.94       0.8955             5.324            3.379   
4  16.14      14.99       0.9034             5.658            3.562   

   Asymmetry_Coeff  Kernel_Groove  Type  
0            2.221          5.220     1  
1            1.018          4.956     1  
2            2.699          4.825     1  
3            2.259          4.805     1  
4            1.355          5.175     1  

Creating a Histogram

A histogram shows the distribution of a continuous variable. We'll create a histogram for the kernel length feature to understand its distribution pattern.

Example

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Create sample data similar to seeds dataset
np.random.seed(42)
kernel_length = np.random.normal(5.5, 0.5, 200)

# Calculate optimal number of bins
x = len(kernel_length)
bins = int(np.sqrt(x))

# Create histogram
plt.figure(figsize=(8, 6))
plt.hist(kernel_length, bins=bins, color='#FF4040', alpha=0.7, edgecolor='black')
plt.xlabel('Kernel Length')
plt.ylabel('Frequency')
plt.title('Distribution of Kernel Length')
plt.grid(True, alpha=0.3)
plt.show()

Empirical Cumulative Distribution Function (ECDF)

An ECDF shows the proportion of data points less than or equal to each value. It provides a complete picture of the distribution without binning.

Example

import matplotlib.pyplot as plt
import numpy as np

def ecdf(data):
    """Calculate empirical cumulative distribution function"""
    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n + 1) / n
    return x, y

# Generate sample kernel groove data
np.random.seed(42)
kernel_groove = np.random.normal(5.0, 0.3, 200)

# Calculate ECDF
x, y = ecdf(kernel_groove)

# Create ECDF plot
plt.figure(figsize=(8, 6))
plt.plot(x, y, marker='.', linestyle='none', markersize=8, color='blue')
plt.xlabel('Kernel Groove')
plt.ylabel('ECDF')
plt.title('Empirical Cumulative Distribution Function')
plt.grid(True, alpha=0.3)
plt.show()

Bee Swarm Plots

A bee swarm plot shows individual data points while avoiding overlap, making it easy to see the distribution and density of data points across categories.

Example

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create sample data similar to seeds dataset
np.random.seed(42)
n_samples = 60
data = pd.DataFrame({
    'Type': np.repeat([1, 2, 3], n_samples // 3),
    'Asymmetry_Coeff': np.concatenate([
        np.random.normal(2.5, 0.5, n_samples // 3),
        np.random.normal(3.0, 0.4, n_samples // 3),
        np.random.normal(4.2, 0.6, n_samples // 3)
    ])
})

# Create bee swarm plot
plt.figure(figsize=(8, 6))
sns.swarmplot(x='Type', y='Asymmetry_Coeff', data=data, color='#458B00', size=6)
plt.xlabel('Seed Type')
plt.ylabel('Asymmetry Coefficient')
plt.title('Distribution of Asymmetry Coefficient by Seed Type')
plt.show()

Key Benefits of Statistical Visualizations

Visualization Type Best For Key Insight
Histogram Single variable distribution Shape and spread of data
ECDF Cumulative probabilities Percentiles and quartiles
Bee Swarm Plot Categorical comparisons Individual points and density

Conclusion

Statistical thinking in Python involves using appropriate visualizations to understand data patterns. Histograms reveal distributions, ECDFs show cumulative probabilities, and bee swarm plots compare groups while preserving individual data points. These tools form the foundation for data-driven decision making in machine learning and AI.

Updated on: 2026-03-15T17:31:09+05:30

449 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements