Note
Go to the end to download the full example code.
Bagging classifiers using sampler#
In this example, we show how
BalancedBaggingClassifier
can be used to create a
large variety of classifiers by giving different samplers.
We will give several examples that have been published in the passed year.
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)
Generate an imbalanced dataset#
For this example, we will create a synthetic dataset using the function
make_classification
. The problem will be a toy
classification problem with a ratio of 1:9 between the two classes.
1 0.8977
0 0.1023
Name: proportion, dtype: float64
In the following sections, we will show a couple of algorithms that have
been proposed over the years. We intend to illustrate how one can reuse the
BalancedBaggingClassifier
by passing different
sampler.
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_validate
ebb = BaggingClassifier()
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.712 +/- 0.012
Exactly Balanced Bagging and Over-Bagging#
The BalancedBaggingClassifier
can use in
conjunction with a RandomUnderSampler
or
RandomOverSampler
. These methods are
referred as Exactly Balanced Bagging and Over-Bagging, respectively and have
been proposed first in [1].
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler
# Exactly Balanced Bagging
ebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.751 +/- 0.011
from imblearn.over_sampling import RandomOverSampler
# Over-bagging
over_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())
cv_results = cross_validate(over_bagging, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.695 +/- 0.017
SMOTE-Bagging#
Instead of using a RandomOverSampler
that
make a bootstrap, an alternative is to use
SMOTE
as an over-sampler. This is known as
SMOTE-Bagging [2].
from imblearn.over_sampling import SMOTE
# SMOTE-Bagging
smote_bagging = BalancedBaggingClassifier(sampler=SMOTE())
cv_results = cross_validate(smote_bagging, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.741 +/- 0.015
Roughly Balanced Bagging#
While using a RandomUnderSampler
or
RandomOverSampler
will create exactly the
desired number of samples, it does not follow the statistical spirit wanted
in the bagging framework. The authors in [3] proposes to use a negative
binomial distribution to compute the number of samples of the majority
class to be selected and then perform a random under-sampling.
Here, we illustrate this method by implementing a function in charge of
resampling and use the FunctionSampler
to integrate it
within a Pipeline
and
cross_validate
.
from collections import Counter
import numpy as np
from imblearn import FunctionSampler
def roughly_balanced_bagging(X, y, replace=False):
"""Implementation of Roughly Balanced Bagging for binary problem."""
# find the minority and majority classes
class_counts = Counter(y)
majority_class = max(class_counts, key=class_counts.get)
minority_class = min(class_counts, key=class_counts.get)
# compute the number of sample to draw from the majority class using
# a negative binomial distribution
n_minority_class = class_counts[minority_class]
n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)
# draw randomly with or without replacement
majority_indices = np.random.choice(
np.flatnonzero(y == majority_class),
size=n_majority_resampled,
replace=replace,
)
minority_indices = np.random.choice(
np.flatnonzero(y == minority_class),
size=n_minority_class,
replace=replace,
)
indices = np.hstack([majority_indices, minority_indices])
return X[indices], y[indices]
# Roughly Balanced Bagging
rbb = BalancedBaggingClassifier(
sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
)
cv_results = cross_validate(rbb, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.752 +/- 0.016
Total running time of the script: (0 minutes 23.830 seconds)
Estimated memory usage: 198 MB