Introduction

Data binning is a powerful technique in data analysis, allowing us to organize and gain insights from datasets effectively. In this exploration, we’ll dissect a Python script that utilizes NumPy and Pandas to implement two types of data binning: equal-width and equal-depth.

image tooltip here

Generating Random Data

Let’s start by generating a random dataset using NumPy:

import numpy as np

data = np.random.randint(low=1, high=1200, size=20)

This dataset will be used to demonstrate the details of both equal-width and equal-depth binning techniques.

Sorting the Data

Before we delve into binning, let’s highlight the significance of sorting the dataset:

print(data)
print(np.sort(data))

Sorting the data is a crucial step that forms the basis for subsequent calculations involving bin boundaries. It grants us a clearer perspective on the distribution of data values.

Equal-Width Binning

Equal-width binning entails dividing the range of the dataset into intervals of equal width. Here’s a detailed breakdown of the steps involved:

# Calculate Bin Boundaries
num_bins_width = 5
bin_boundaries_width = np.linspace(min(data), max(data), num_bins_width + 1, endpoint=True)

# Digitize Data
bin_indices = np.digitize(data, bin_boundaries_width, right=False)

# Initialize Bins List
equal_width_bins = []
for i in range(1, num_bins_width + 1):
    if i not in bin_indices:
        bin_mean = np.nan
        bin_median = np.nan
        bin_boundary = []
        bin_values = []
    else:
        bin_values = np.sort(data[bin_indices == i])
        bin_mean = np.round(np.mean(bin_values), 2)
        bin_median = np.median(bin_values)
        bin_boundary = np.where(bin_values - bin_boundaries_width[i - 1] < bin_boundaries_width[i] - bin_values,
                                bin_boundaries_width[i - 1], bin_boundaries_width[i])
        bin_boundary = np.where(bin_values - np.min(bin_values) < np.max(bin_values) - bin_values,
                                np.min(bin_values), np.max(bin_values))
        equal_width_bins.append({
            'Bin': i,
            'Interval': (bin_boundaries_width[i - 1], bin_boundaries_width[i]),
            'Data_Values': bin_values,
            'Bin_Mean': bin_mean,
            'Bin_Median': bin_median,
            'Bin_Boundary_Smoothing': bin_boundary
        })

df_equal_width_bins = pd.DataFrame(equal_width_bins)

df_equal_width_bins

Equal-Depth Binning

Equal-depth binning aims to create intervals with an equal number of data points. Here’s the corresponding code:

# Determine Bin Boundaries
num_bins_depth = 5
bin_size_depth = int(np.round(len(data) / num_bins_depth, 0))
sorted_data = np.sort(data)
bin_boundaries_depth = [sorted_data[i * bin_size_depth] for i in range(num_bins_depth)] + [max(data)]

# Digitize Data
bin_indices = np.digitize(data, bin_boundaries_depth, right=False)

# Initialize Bins List
equal_depth_bins = []
for i in range(1, num_bins_depth + 1):
    if i not in bin_indices:
        bin_mean = np.nan
        bin_median = np.nan
        bin_boundary = []
        bin_values = []
    else:
        bin_values = np.sort(data[bin_indices == i])
        bin_mean = np.round(np.mean(bin_values), 2)
        bin_median = np.median(bin_values)
        bin_boundary = np.where(bin_values - bin_boundaries_depth[i - 1] < bin_boundaries_depth[i] - bin_values,
                                bin_boundaries_depth[i - 1], bin_boundaries_depth[i])
        bin_boundary = np.where(bin_values - np.min(bin_values) < np.max(bin_values) - bin_values,
                                np.min(bin_values), np.max(bin_values))
        equal_depth_bins.append({
            'Bin': i,
            'Interval': (bin_boundaries_depth[i - 1], bin_boundaries_depth[i]),
            'Data_Values': bin_values,
            'Bin_Mean': bin_mean,
            'Bin_Median': bin_median,
            'Bin_Boundary_Smoothing': bin_boundary
        })

df_equal_depth_bins = pd.DataFrame(equal_depth_bins)

df_equal_depth_bins

Summary

The fundamental difference between equal-width and equal-depth binning lies in how the intervals are defined. Equal-width ensures consistent data value ranges, while equal-depth maintains a consistent number of data points per interval.

Matplotlib is a commonly used plotting library in Python known for its versatility. We will use it to generate histograms, visually representing how data is distributed within each bin. The following code snippet demonstrates this process.

import matplotlib.pyplot as plt


# Plot histograms
plt.figure(figsize=(12, 6))

# Equal-width binning
plt.subplot(1, 2, 1)
plt.hist(data, bins=bin_boundaries_width, edgecolor='black')
plt.title('Equal-Width Binning')
plt.xlabel('Data Values')
plt.ylabel('Frequency')
for i, bin_boundary in enumerate(bin_boundaries_width):
    plt.axvline(bin_boundary, color='r', linestyle='dashed', linewidth=1)
plt.tight_layout()

# Equal-depth binning
plt.subplot(1, 2, 2)
plt.hist(data, bins=bin_boundaries_depth, edgecolor='black')
plt.title('Equal-Depth Binning')
plt.xlabel('Data Values')
plt.ylabel('Frequency')
for i, bin_boundary in enumerate(bin_boundaries_depth):
    plt.axvline(bin_boundary, color='r', linestyle='dashed', linewidth=1)
plt.tight_layout()

plt.show()