Usage of pipeline embedding samplers#

An example of the :class:~imblearn.pipeline.Pipeline` object (or make_pipeline helper function) working with transformers and resamplers.

# Authors: Christos Aridas
#          Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT

print(__doc__)

Let’s first create an imbalanced dataset and split in to two sets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_classes=2,
    class_sep=1.25,
    weights=[0.3, 0.7],
    n_informative=3,
    n_redundant=1,
    flip_y=0,
    n_features=5,
    n_clusters_per_class=1,
    n_samples=5000,
    random_state=10,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

Now, we will create each individual steps that we would like later to combine

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours

pca = PCA(n_components=2)
enn = EditedNearestNeighbours()
smote = SMOTE(random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)

Now, we can finally create a pipeline to specify in which order the different transformers and samplers should be executed before to provide the data to the final classifier.

from imblearn.pipeline import make_pipeline

model = make_pipeline(pca, enn, smote, knn)

We can now use the pipeline created as a normal classifier where resampling will happen when calling fit and disabled when calling decision_function, predict_proba, or predict.

from sklearn.metrics import classification_report

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       375
           1       1.00      1.00      1.00       875

    accuracy                           0.99      1250
   macro avg       0.99      0.99      0.99      1250
weighted avg       0.99      0.99      0.99      1250

Total running time of the script: (0 minutes 1.229 seconds)

Estimated memory usage: 205 MB

Gallery generated by Sphinx-Gallery

Usage of pipeline embedding samplers#

This Page