.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/applications/plot_over_sampling_benchmark_lfw.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_applications_plot_over_sampling_benchmark_lfw.py: ========================================================== Benchmark over-sampling methods in a face recognition task ========================================================== In this face recognition example two faces are used from the LFW (Faces in the Wild) dataset. Several implemented over-sampling methods are used in conjunction with a 3NN classifier in order to examine the improvement of the classifier's output quality by using an over-sampler. .. GENERATED FROM PYTHON SOURCE LINES 12-17 .. code-block:: Python # Authors: Christos Aridas # Guillaume Lemaitre # License: MIT .. GENERATED FROM PYTHON SOURCE LINES 18-24 .. code-block:: Python print(__doc__) import seaborn as sns sns.set_context("poster") .. GENERATED FROM PYTHON SOURCE LINES 25-31 Load the dataset ---------------- We will use a dataset containing image from know person where we will build a model to recognize the person on the image. We will make this problem a binary problem by taking picture of only George W. Bush and Bill Clinton. .. GENERATED FROM PYTHON SOURCE LINES 33-42 .. code-block:: Python import numpy as np from sklearn.datasets import fetch_lfw_people data = fetch_lfw_people() george_bush_id = 1871 # Photos of George W. Bush bill_clinton_id = 531 # Photos of Bill Clinton classes = [george_bush_id, bill_clinton_id] classes_name = np.array(["B. Clinton", "G.W. Bush"], dtype=object) .. GENERATED FROM PYTHON SOURCE LINES 43-48 .. code-block:: Python mask_photos = np.isin(data.target, classes) X, y = data.data[mask_photos], data.target[mask_photos] y = (y == george_bush_id).astype(np.int8) y = classes_name[y] .. GENERATED FROM PYTHON SOURCE LINES 49-50 We can check the ratio between the two classes. .. GENERATED FROM PYTHON SOURCE LINES 52-62 .. code-block:: Python import matplotlib.pyplot as plt import pandas as pd class_distribution = pd.Series(y).value_counts(normalize=True) ax = class_distribution.plot.barh() ax.set_title("Class distribution") pos_label = class_distribution.idxmin() plt.tight_layout() print(f"The positive label considered as the minority class is {pos_label}") .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_over_sampling_benchmark_lfw_001.png :alt: Class distribution :srcset: /auto_examples/applications/images/sphx_glr_plot_over_sampling_benchmark_lfw_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none The positive label considered as the minority class is B. Clinton .. GENERATED FROM PYTHON SOURCE LINES 63-74 We see that we have an imbalanced classification problem with ~95% of the data belonging to the class G.W. Bush. Compare over-sampling approaches -------------------------------- We will use different over-sampling approaches and use a kNN classifier to check if we can recognize the 2 presidents. The evaluation will be performed through cross-validation and we will plot the mean ROC curve. We will create different pipelines and evaluate them. .. GENERATED FROM PYTHON SOURCE LINES 74-90 .. code-block:: Python from sklearn.neighbors import KNeighborsClassifier from imblearn import FunctionSampler from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler from imblearn.pipeline import make_pipeline classifier = KNeighborsClassifier(n_neighbors=3) pipeline = [ make_pipeline(FunctionSampler(), classifier), make_pipeline(RandomOverSampler(random_state=42), classifier), make_pipeline(ADASYN(random_state=42), classifier), make_pipeline(SMOTE(random_state=42), classifier), ] .. GENERATED FROM PYTHON SOURCE LINES 91-95 .. code-block:: Python from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=3) .. GENERATED FROM PYTHON SOURCE LINES 96-99 We will compute the mean ROC curve for each pipeline using a different splits provided by the :class:`~sklearn.model_selection.StratifiedKFold` cross-validation. .. GENERATED FROM PYTHON SOURCE LINES 101-133 .. code-block:: Python from sklearn.metrics import RocCurveDisplay, auc, roc_curve disp = [] for model in pipeline: # compute the mean fpr/tpr to get the mean ROC curve mean_tpr, mean_fpr = 0.0, np.linspace(0, 1, 100) for train, test in cv.split(X, y): model.fit(X[train], y[train]) y_proba = model.predict_proba(X[test]) pos_label_idx = np.flatnonzero(model.classes_ == pos_label)[0] fpr, tpr, thresholds = roc_curve( y[test], y_proba[:, pos_label_idx], pos_label=pos_label ) mean_tpr += np.interp(mean_fpr, fpr, tpr) mean_tpr[0] = 0.0 mean_tpr /= cv.get_n_splits(X, y) mean_tpr[-1] = 1.0 mean_auc = auc(mean_fpr, mean_tpr) # Create a display that we will reuse to make the aggregated plots for # all methods disp.append( RocCurveDisplay( fpr=mean_fpr, tpr=mean_tpr, roc_auc=mean_auc, estimator_name=f"{model[0].__class__.__name__}", ) ) .. GENERATED FROM PYTHON SOURCE LINES 134-136 In the previous cell, we created the different mean ROC curve and we can plot them on the same plot. .. GENERATED FROM PYTHON SOURCE LINES 138-151 .. code-block:: Python fig, ax = plt.subplots(figsize=(9, 9)) for d in disp: d.plot(ax=ax, linestyle="--") ax.plot([0, 1], [0, 1], linestyle="--", color="k") ax.axis("square") fig.suptitle("Comparison of over-sampling methods \nwith a 3NN classifier") ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) sns.despine(offset=10, ax=ax) plt.legend(loc="lower right", fontsize=16) plt.tight_layout() plt.show() .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_over_sampling_benchmark_lfw_002.png :alt: Comparison of over-sampling methods with a 3NN classifier :srcset: /auto_examples/applications/images/sphx_glr_plot_over_sampling_benchmark_lfw_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 152-155 We see that for this task, methods that are generating new samples with some interpolation (i.e. ADASYN and SMOTE) perform better than random over-sampling or no resampling. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 35.132 seconds) **Estimated memory usage:** 299 MB .. _sphx_glr_download_auto_examples_applications_plot_over_sampling_benchmark_lfw.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_over_sampling_benchmark_lfw.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_over_sampling_benchmark_lfw.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_