5. Ensemble of samplers#
5.1. Classifier including inner balancing samplers#
5.1.1. Bagging classifier#
In ensemble classifiers, bagging methods build several estimators on different
randomly selected subset of data. In scikit-learn, this classifier is named
BaggingClassifier. However, this classifier does not
allow each subset of data to be balanced. Therefore, when training on an imbalanced
data set, this classifier will favor the majority classes:
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94], class_sep=0.8,
... random_state=0)
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import balanced_accuracy_score
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> bc = BaggingClassifier(DecisionTreeClassifier(), random_state=0)
>>> bc.fit(X_train, y_train) #doctest:
BaggingClassifier(...)
>>> y_pred = bc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.77...
In BalancedBaggingClassifier, each bootstrap sample will be further
resampled to achieve the sampling_strategy desired. Therefore,
BalancedBaggingClassifier takes the same parameters as the
scikit-learn BaggingClassifier. In addition, the
sampling is controlled by the parameter sampler or the two parameters
sampling_strategy and replacement, if one wants to use the
RandomUnderSampler:
>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(DecisionTreeClassifier(),
... sampling_strategy='auto',
... replacement=False,
... random_state=0)
>>> bbc.fit(X_train, y_train)
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...
Changing the sampler will give rise to different known implementations
[MO97], [HKT09],
[WY09]. You can refer to the following example which shows these
different methods in practice:
Bagging classifiers using sampler
5.1.2. Forest of randomized trees#
BalancedRandomForestClassifier is another ensemble method in which
each tree of the forest will be provided a balanced bootstrap sample
[CLB+04]. This class provides all functionality of the
RandomForestClassifier:
>>> from imblearn.ensemble import BalancedRandomForestClassifier
>>> brf = BalancedRandomForestClassifier(
... n_estimators=100, random_state=0, sampling_strategy="all", replacement=True,
... bootstrap=False,
... )
>>> brf.fit(X_train, y_train)
BalancedRandomForestClassifier(...)
>>> y_pred = brf.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...
5.1.3. Boosting#
Several methods taking advantage of boosting have been designed.
RUSBoostClassifier randomly under-samples the dataset before performing
a boosting iteration [SKVHN09]:
>>> from imblearn.ensemble import RUSBoostClassifier
>>> rusboost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R',
... random_state=0)
>>> rusboost.fit(X_train, y_train)
RUSBoostClassifier(...)
>>> y_pred = rusboost.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0...
A specific method which uses AdaBoostClassifier as
learners in the bagging classifier is called “EasyEnsemble”. The
EasyEnsembleClassifier allows bagging AdaBoost learners which are
trained on balanced bootstrap samples [LWZ08]. Similarly to
the BalancedBaggingClassifier API, one can construct the ensemble as:
>>> from imblearn.ensemble import EasyEnsembleClassifier
>>> eec = EasyEnsembleClassifier(random_state=0)
>>> eec.fit(X_train, y_train)
EasyEnsembleClassifier(...)
>>> y_pred = eec.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.6...