# Bagging classifiers using sampler#

In this example, we show how `BalancedBaggingClassifier` can be used to create a large variety of classifiers by giving different samplers.

We will give several examples that have been published in the passed year.

```# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
```
```print(__doc__)
```

## Generate an imbalanced dataset#

For this example, we will create a synthetic dataset using the function `make_classification`. The problem will be a toy classification problem with a ratio of 1:9 between the two classes.

```from sklearn.datasets import make_classification

X, y = make_classification(
n_samples=10_000,
n_features=10,
weights=[0.1, 0.9],
class_sep=0.5,
random_state=0,
)
```
```import pandas as pd

pd.Series(y).value_counts(normalize=True)
```
```1    0.8977
0    0.1023
Name: proportion, dtype: float64
```

In the following sections, we will show a couple of algorithms that have been proposed over the years. We intend to illustrate how one can reuse the `BalancedBaggingClassifier` by passing different sampler.

```from sklearn.ensemble import BaggingClassifier
```
```from sklearn.model_selection import cross_validate

ebb = BaggingClassifier()
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
```
```0.713 +/- 0.015
```

## Exactly Balanced Bagging and Over-Bagging#

The `BalancedBaggingClassifier` can use in conjunction with a `RandomUnderSampler` or `RandomOverSampler`. These methods are referred as Exactly Balanced Bagging and Over-Bagging, respectively and have been proposed first in [1].

```from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler

# Exactly Balanced Bagging
ebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
```
```0.760 +/- 0.015
```
```from imblearn.over_sampling import RandomOverSampler

# Over-bagging
over_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())
cv_results = cross_validate(over_bagging, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
```
```0.699 +/- 0.012
```

## SMOTE-Bagging#

Instead of using a `RandomOverSampler` that make a bootstrap, an alternative is to use `SMOTE` as an over-sampler. This is known as SMOTE-Bagging [2].

```from imblearn.over_sampling import SMOTE

# SMOTE-Bagging
smote_bagging = BalancedBaggingClassifier(sampler=SMOTE())
cv_results = cross_validate(smote_bagging, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
```
```0.746 +/- 0.012
```

## Roughly Balanced Bagging#

While using a `RandomUnderSampler` or `RandomOverSampler` will create exactly the desired number of samples, it does not follow the statistical spirit wanted in the bagging framework. The authors in [3] proposes to use a negative binomial distribution to compute the number of samples of the majority class to be selected and then perform a random under-sampling.

Here, we illustrate this method by implementing a function in charge of resampling and use the `FunctionSampler` to integrate it within a `Pipeline` and `cross_validate`.

```from collections import Counter

import numpy as np

from imblearn import FunctionSampler

def roughly_balanced_bagging(X, y, replace=False):
"""Implementation of Roughly Balanced Bagging for binary problem."""
# find the minority and majority classes
class_counts = Counter(y)
majority_class = max(class_counts, key=class_counts.get)
minority_class = min(class_counts, key=class_counts.get)

# compute the number of sample to draw from the majority class using
# a negative binomial distribution
n_minority_class = class_counts[minority_class]
n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)

# draw randomly with or without replacement
majority_indices = np.random.choice(
np.flatnonzero(y == majority_class),
size=n_majority_resampled,
replace=replace,
)
minority_indices = np.random.choice(
np.flatnonzero(y == minority_class),
size=n_minority_class,
replace=replace,
)
indices = np.hstack([majority_indices, minority_indices])

return X[indices], y[indices]

# Roughly Balanced Bagging
rbb = BalancedBaggingClassifier(
sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
)
cv_results = cross_validate(rbb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
```
```0.760 +/- 0.018
```

Total running time of the script: (0 minutes 25.956 seconds)

Estimated memory usage: 10 MB

Gallery generated by Sphinx-Gallery