Understanding and Addressing Imbalanced Datasets in Machine Learning

Introduction

Imbalanced datasets, where one class significantly outnumbers the others, can pose challenges for machine learning models. This blog post explores strategies to address such imbalances using Python and scikit-learn. We will work through a real-world example using the eBay Auctions dataset.

Loading and Exploring the Dataset

We begin by importing necessary libraries and loading the dataset. The eBay Auctions dataset, available in the DMBA repository, contains information about auctions on eBay.

# Code for loading the dataset
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import gdown
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Download and load the eBay Auctions dataset
url="https://github.com/gedeck/dmba/raw/master/datasets/dmba-datasets.zip"
gdown.download(url,'file.zip',quiet=True)
!unzip -o file.zip &> /dev/null
df=pd.read_csv('/content/dmba/eBayAuctions.csv')
df.head()

Next, we preprocess the data by one-hot encoding categorical variables and preparing the features (X) and labels (y) for model training.

# Code for data preprocessing
df=pd.get_dummies(data=df,columns=df.columns[0:-1]).assign(**{f"Competitive?": df['Competitive?'].to_list()})
df.head()

# Split the data into features(X) and labels(y)
Xtrain=df.drop('Competitive?',axis=1)
ytrain=df['Competitive?']

# Visualize class distribution
counts = ytrain.value_counts()
plt.pie(counts, labels=counts.index, autopct=lambda p: '{:.0f}'.format(p * sum(counts) / 100))
plt.show()

Handling Imbalanced Data

Imbalanced datasets can lead to biased models. To address this, we explore two common techniques: random under-sampling and random over-sampling.

Random Under-sampling

# Code for random under-sampling
rus = RandomUnderSampler(sampling_strategy=1)
Xtrain_us, ytrain_us = rus.fit_resample(Xtrain, ytrain)
counts = ytrain_us.value_counts()
plt.pie(counts, labels=counts.index, autopct=lambda p: '{:.0f}'.format(p * sum(counts) / 100))
plt.title("Under-sampling");

Random Over-sampling

# Code for random over-sampling
ros = RandomOverSampler(sampling_strategy=1)
Xtrain_os, ytrain_os = ros.fit_resample(Xtrain, ytrain)
counts = ytrain_os.value_counts()
plt.pie(counts, labels=counts.index, autopct=lambda p: '{:.0f}'.format(p * sum(counts) / 100))
plt.title("Over-sampling");

Model Training and Evaluation

We train a Decision Tree Classifier using a pipeline that incorporates either under-sampling or over-sampling.

# Code for model training and evaluation
steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
scores_us = cross_val_score(pipeline, Xtrain, ytrain, scoring='f1_micro', cv=10, n_jobs=-1)
score = np.mean(scores_us)
print('F1 Score (Under-sampling): %.3f' % score)

steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
scores_os = cross_val_score(pipeline, Xtrain, ytrain, scoring='f1_micro', cv=10, n_jobs=-1)
score = np.mean(scores_os)
print('F1 Score (Over-sampling): %.3f' % score)

Assessing the Impact

To understand the impact of under-sampling and over-sampling on model performance, we perform a two-sample t-test on the F1 scores.

# Code for two-sample t-test
import scipy.stats as stats
t_statistic, p_value = stats.ttest_rel(scores_us, scores_os)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("The difference between the two samples is statistically significant.")
else:
    print("There is no significant difference between the two samples.")

Conclusion

In this guide, we explored techniques for handling imbalanced datasets using under-sampling and over-sampling. We applied these methods to a real-world dataset and evaluated their impact on model performance. The two-sample t-test provided a statistical assessment of the differences in F1 scores between the under-sampling and over-sampling approaches. Understanding and addressing imbalanced datasets are crucial steps towards building more robust and unbiased machine learning models.

Note

The following equations describe the process of conducting a t-test for two independent samples, where (d) represents the difference between paired observations. Here’s an explanation for each equation:

Mean of Differences (\(\bar{d}\)): \[ \bar{d} = \frac{\sum d_i}{n} \] This equation calculates the average difference (\(\bar{d}\)) between paired observations. It involves summing up all the differences (\(d_i\)) and dividing by the number of pairs (\(n\)).
Standard Deviation of Differences (\(s_d\)): \[ s_d = \sqrt{\frac{\sum\left(d_i-\bar{d}\right)^2}{n-1}} \] \(s_d\) represents the standard deviation of the differences. It is computed by taking the square root of the sum of squared deviations from the mean difference, divided by (n-1) (for sample standard deviation).
T-statistic (\(t\)): \[ t = \frac{\bar{d}-\mu_d}{s_d / \sqrt{n}} \] The t-statistic measures how many standard deviations the sample mean difference ((\bar{d})) is from the hypothesized population mean difference (\(\mu_d\)). It is calculated by dividing the difference between \(\bar{d}\) and \(\mu_d\) by the standard error of the mean difference (\(s_d / \sqrt{n}\)).
Alternative T-statistic (\(t\)):

\[ t = \frac{\bar{x_1} - \bar{x_2}} {\text{var}(x_{1} - x_{2}) / n} \] Alternatively, you can compute the t-statistic directly using the sample means (\(\bar{x_1}\) and \(\bar{x_2}\)) and the variance of the differences (\(\text{var}(x_{1} - x_{2})\)) divided by \(n\).

These equations are fundamental in hypothesis testing to determine if the means of two independent samples are significantly different.

Once you have calculated the t-statistic, the next step is to determine whether it is significant. The significance of the t-statistic is typically assessed by comparing it to a critical value from the t-distribution or by calculating a p-value.

Using Critical Values:

Determine Degrees of Freedom (\(df\)): The degrees of freedom for the t-distribution in a two-sample t-test is given by \(df = n_1 + n_2 - 2\), where \(n_1\) and \(n_2\) are the sample sizes.
Find Critical Value: Look up the critical value for your chosen significance level (e.g., 0.05) and degrees of freedom in the t-distribution table. The critical value corresponds to the point beyond which you would reject the null hypothesis.
Compare t-statistic and Critical Value: If the absolute value of the t-statistic is greater than the critical value, you reject the null hypothesis. The greater the difference between the t-statistic and the critical value, the stronger the evidence against the null hypothesis.

Using P-Value:

Calculate P-Value: Use the t-statistic and the degrees of freedom to calculate the p-value. The p-value represents the probability of observing a t-statistic as extreme as, or more extreme than, the one obtained, assuming the null hypothesis is true.
Compare P-Value and Significance Level: Compare the p-value to your chosen significance level (e.g., 0.05). If the p-value is less than or equal to the significance level, you reject the null hypothesis. A smaller p-value indicates stronger evidence against the null hypothesis.

Decision:

Reject Null Hypothesis:
- If t-statistic > Critical Value (or) P-Value ? Significance Level
- Conclude that there is sufficient evidence to reject the null hypothesis.
Fail to Reject Null Hypothesis:
- If t-statistic \(\leq\) Critical Value (or) P-Value > Significance Level
- Conclude that there is not enough evidence to reject the null hypothesis.

In summary, the decision to reject or fail to reject the null hypothesis is based on comparing the t-statistic to critical values or using p-values, depending on the chosen approach.