Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Statistical Thinking in Python
Statistical thinking is fundamental for machine learning and AI. Since Python is the language of choice for these technologies, we will explore how to write Python programs that incorporate statistical analysis. In this article, we will create graphs and charts using various Python modules to analyze data quickly and derive insights through visualization.
Data Preparation
We'll use a dataset containing information about various seeds. This dataset is available on Kaggle and has eight columns that we'll use to create different types of charts for comparing seed features. The program below loads the dataset and displays sample rows.
Example
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
# Load the seeds dataset
datainput = pd.read_csv('seeds.csv')
# Dataset available at: https://www.kaggle.com/jmcaro/wheat-seedsuci
print(datainput.head())
Running the above code gives us the following result ?
Area Perimeter Compactness Length_of_Kernel Width_of_Kernel \
0 15.26 14.84 0.8710 5.763 3.312
1 14.88 14.57 0.8811 5.554 3.333
2 14.29 14.09 0.9050 5.291 3.337
3 13.84 13.94 0.8955 5.324 3.379
4 16.14 14.99 0.9034 5.658 3.562
Asymmetry_Coeff Kernel_Groove Type
0 2.221 5.220 1
1 1.018 4.956 1
2 2.699 4.825 1
3 2.259 4.805 1
4 1.355 5.175 1
Creating a Histogram
A histogram shows the distribution of a continuous variable. We'll create a histogram for the kernel length feature to understand its distribution pattern.
Example
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Create sample data similar to seeds dataset
np.random.seed(42)
kernel_length = np.random.normal(5.5, 0.5, 200)
# Calculate optimal number of bins
x = len(kernel_length)
bins = int(np.sqrt(x))
# Create histogram
plt.figure(figsize=(8, 6))
plt.hist(kernel_length, bins=bins, color='#FF4040', alpha=0.7, edgecolor='black')
plt.xlabel('Kernel Length')
plt.ylabel('Frequency')
plt.title('Distribution of Kernel Length')
plt.grid(True, alpha=0.3)
plt.show()
Empirical Cumulative Distribution Function (ECDF)
An ECDF shows the proportion of data points less than or equal to each value. It provides a complete picture of the distribution without binning.
Example
import matplotlib.pyplot as plt
import numpy as np
def ecdf(data):
"""Calculate empirical cumulative distribution function"""
n = len(data)
x = np.sort(data)
y = np.arange(1, n + 1) / n
return x, y
# Generate sample kernel groove data
np.random.seed(42)
kernel_groove = np.random.normal(5.0, 0.3, 200)
# Calculate ECDF
x, y = ecdf(kernel_groove)
# Create ECDF plot
plt.figure(figsize=(8, 6))
plt.plot(x, y, marker='.', linestyle='none', markersize=8, color='blue')
plt.xlabel('Kernel Groove')
plt.ylabel('ECDF')
plt.title('Empirical Cumulative Distribution Function')
plt.grid(True, alpha=0.3)
plt.show()
Bee Swarm Plots
A bee swarm plot shows individual data points while avoiding overlap, making it easy to see the distribution and density of data points across categories.
Example
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Create sample data similar to seeds dataset
np.random.seed(42)
n_samples = 60
data = pd.DataFrame({
'Type': np.repeat([1, 2, 3], n_samples // 3),
'Asymmetry_Coeff': np.concatenate([
np.random.normal(2.5, 0.5, n_samples // 3),
np.random.normal(3.0, 0.4, n_samples // 3),
np.random.normal(4.2, 0.6, n_samples // 3)
])
})
# Create bee swarm plot
plt.figure(figsize=(8, 6))
sns.swarmplot(x='Type', y='Asymmetry_Coeff', data=data, color='#458B00', size=6)
plt.xlabel('Seed Type')
plt.ylabel('Asymmetry Coefficient')
plt.title('Distribution of Asymmetry Coefficient by Seed Type')
plt.show()
Key Benefits of Statistical Visualizations
| Visualization Type | Best For | Key Insight |
|---|---|---|
| Histogram | Single variable distribution | Shape and spread of data |
| ECDF | Cumulative probabilities | Percentiles and quartiles |
| Bee Swarm Plot | Categorical comparisons | Individual points and density |
Conclusion
Statistical thinking in Python involves using appropriate visualizations to understand data patterns. Histograms reveal distributions, ECDFs show cumulative probabilities, and bee swarm plots compare groups while preserving individual data points. These tools form the foundation for data-driven decision making in machine learning and AI.
