Bagging classifiers using sampler#

In this example, we show how BalancedBaggingClassifier can be used to create a large variety of classifiers by giving different samplers.

We will give several examples that have been published in the passed year.

# Authors: Guillaume Lemaitre <>
# License: MIT

Generate an imbalanced dataset#

For this example, we will create a synthetic dataset using the function make_classification. The problem will be a toy classification problem with a ratio of 1:9 between the two classes.

from sklearn.datasets import make_classification

X, y = make_classification(
    weights=[0.1, 0.9],
import pandas as pd

1    0.8977
0    0.1023
dtype: float64

In the following sections, we will show a couple of algorithms that have been proposed over the years. We intend to illustrate how one can reuse the BalancedBaggingClassifier by passing different sampler.

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_validate

ebb = BaggingClassifier()
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.710 +/- 0.012

Exactly Balanced Bagging and Over-Bagging#

The BalancedBaggingClassifier can use in conjunction with a RandomUnderSampler or RandomOverSampler. These methods are referred as Exactly Balanced Bagging and Over-Bagging, respectively and have been proposed first in 1.

from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler

# Exactly Balanced Bagging
ebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.750 +/- 0.015
from imblearn.over_sampling import RandomOverSampler

# Over-bagging
over_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())
cv_results = cross_validate(over_bagging, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.711 +/- 0.014


Instead of using a RandomOverSampler that make a bootstrap, an alternative is to use SMOTE as an over-sampler. This is known as SMOTE-Bagging 2.

from imblearn.over_sampling import SMOTE

# SMOTE-Bagging
smote_bagging = BalancedBaggingClassifier(sampler=SMOTE())
cv_results = cross_validate(smote_bagging, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.746 +/- 0.010

Roughly Balanced Bagging#

While using a RandomUnderSampler or RandomOverSampler will create exactly the desired number of samples, it does not follow the statistical spirit wanted in the bagging framework. The authors in 3 proposes to use a negative binomial distribution to compute the number of samples of the majority class to be selected and then perform a random under-sampling.

Here, we illustrate this method by implementing a function in charge of resampling and use the FunctionSampler to integrate it within a Pipeline and cross_validate.

from collections import Counter

import numpy as np

from imblearn import FunctionSampler

def roughly_balanced_bagging(X, y, replace=False):
    """Implementation of Roughly Balanced Bagging for binary problem."""
    # find the minority and majority classes
    class_counts = Counter(y)
    majority_class = max(class_counts, key=class_counts.get)
    minority_class = min(class_counts, key=class_counts.get)

    # compute the number of sample to draw from the majority class using
    # a negative binomial distribution
    n_minority_class = class_counts[minority_class]
    n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)

    # draw randomly with or without replacement
    majority_indices = np.random.choice(
        np.flatnonzero(y == majority_class),
    minority_indices = np.random.choice(
        np.flatnonzero(y == minority_class),
    indices = np.hstack([majority_indices, minority_indices])

    return X[indices], y[indices]

# Roughly Balanced Bagging
rbb = BalancedBaggingClassifier(
    sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
cv_results = cross_validate(rbb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.759 +/- 0.011



R. Maclin, and D. Opitz. “An empirical evaluation of bagging and boosting.” AAAI/IAAI 1997 (1997): 546-551.


S. Wang, and X. Yao. “Diversity analysis on imbalanced data sets by using ensemble models.” 2009 IEEE symposium on computational intelligence and data mining. IEEE, 2009.


S. Hido, H. Kashima, and Y. Takahashi. “Roughly balanced bagging for imbalanced data.” Statistical Analysis and Data Mining: The ASA Data Science Journal 2.5‐6 (2009): 412-426.

Total running time of the script: ( 0 minutes 24.879 seconds)

Estimated memory usage: 8 MB

Gallery generated by Sphinx-Gallery