5. Ensemble of samplers#
5.1. Classifier including inner balancing samplers#
5.1.1. Bagging classifier#
In ensemble classifiers, bagging methods build several estimators on different
randomly selected subset of data. In scikit-learn, this classifier is named
BaggingClassifier
. However, this classifier does not
allow to balance each subset of data. Therefore, when training on imbalanced
data set, this classifier will favor the majority classes:
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94], class_sep=0.8,
... random_state=0)
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import balanced_accuracy_score
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> bc = BaggingClassifier(DecisionTreeClassifier(), random_state=0)
>>> bc.fit(X_train, y_train) #doctest:
BaggingClassifier(...)
>>> y_pred = bc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.77...
In BalancedBaggingClassifier
, each bootstrap sample will be further
resampled to achieve the sampling_strategy
desired. Therefore,
BalancedBaggingClassifier
takes the same parameters as the
scikit-learn BaggingClassifier
. In addition, the
sampling is controlled by the parameter sampler
or the two parameters
sampling_strategy
and replacement
, if one wants to use the
RandomUnderSampler
:
>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(DecisionTreeClassifier(),
... sampling_strategy='auto',
... replacement=False,
... random_state=0)
>>> bbc.fit(X_train, y_train)
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...
Changing the sampler
will give rise to different known implementation
[MO97], [HKT09],
[WY09]. You can refer to the following example shows in
practice these different methods:
Bagging classifiers using sampler
5.1.2. Forest of randomized trees#
BalancedRandomForestClassifier
is another ensemble method in which
each tree of the forest will be provided a balanced bootstrap sample
[CLB+04]. This class provides all functionality of the
RandomForestClassifier
:
>>> from imblearn.ensemble import BalancedRandomForestClassifier
>>> brf = BalancedRandomForestClassifier(
... n_estimators=100, random_state=0, sampling_strategy="all", replacement=True,
... bootstrap=False,
... )
>>> brf.fit(X_train, y_train)
BalancedRandomForestClassifier(...)
>>> y_pred = brf.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...
5.1.3. Boosting#
Several methods taking advantage of boosting have been designed.
RUSBoostClassifier
randomly under-sample the dataset before to perform
a boosting iteration [SKVHN09]:
>>> from imblearn.ensemble import RUSBoostClassifier
>>> rusboost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R',
... random_state=0)
>>> rusboost.fit(X_train, y_train)
RUSBoostClassifier(...)
>>> y_pred = rusboost.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0...
A specific method which uses AdaBoostClassifier
as
learners in the bagging classifier is called “EasyEnsemble”. The
EasyEnsembleClassifier
allows to bag AdaBoost learners which are
trained on balanced bootstrap samples [LWZ08]. Similarly to
the BalancedBaggingClassifier
API, one can construct the ensemble as:
>>> from imblearn.ensemble import EasyEnsembleClassifier
>>> eec = EasyEnsembleClassifier(random_state=0)
>>> eec.fit(X_train, y_train)
EasyEnsembleClassifier(...)
>>> y_pred = eec.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.6...