.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/under-sampling/plot_comparison_under_sampling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py: =============================== Compare under-sampling samplers =============================== The following example attends to make a qualitative comparison between the different under-sampling algorithms available in the imbalanced-learn package. .. GENERATED FROM PYTHON SOURCE LINES 9-13 .. code-block:: Python # Authors: Guillaume Lemaitre # License: MIT .. GENERATED FROM PYTHON SOURCE LINES 14-20 .. code-block:: Python print(__doc__) import seaborn as sns sns.set_context("poster") .. GENERATED FROM PYTHON SOURCE LINES 21-24 The following function will be used to create toy dataset. It uses the :func:`~sklearn.datasets.make_classification` from scikit-learn but fixing some parameters. .. GENERATED FROM PYTHON SOURCE LINES 27-51 .. code-block:: Python from sklearn.datasets import make_classification def create_dataset( n_samples=1000, weights=(0.01, 0.01, 0.98), n_classes=3, class_sep=0.8, n_clusters=1, ): return make_classification( n_samples=n_samples, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=n_classes, n_clusters_per_class=n_clusters, weights=list(weights), class_sep=class_sep, random_state=0, ) .. GENERATED FROM PYTHON SOURCE LINES 52-54 The following function will be used to plot the sample space after resampling to illustrate the specificities of an algorithm. .. GENERATED FROM PYTHON SOURCE LINES 57-66 .. code-block:: Python def plot_resampling(X, y, sampler, ax, title=None): X_res, y_res = sampler.fit_resample(X, y) ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor="k") if title is None: title = f"Resampling with {sampler.__class__.__name__}" ax.set_title(title) sns.despine(ax=ax, offset=10) .. GENERATED FROM PYTHON SOURCE LINES 67-69 The following function will be used to plot the decision function of a classifier given some data. .. GENERATED FROM PYTHON SOURCE LINES 72-91 .. code-block:: Python import numpy as np def plot_decision_function(X, y, clf, ax, title=None): plot_step = 0.02 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid( np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step) ) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) ax.contourf(xx, yy, Z, alpha=0.4) ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor="k") if title is not None: ax.set_title(title) .. GENERATED FROM PYTHON SOURCE LINES 92-97 .. code-block:: Python from sklearn.linear_model import LogisticRegression clf = LogisticRegression() .. GENERATED FROM PYTHON SOURCE LINES 98-103 Prototype generation: under-sampling by generating new samples -------------------------------------------------------------- :class:`~imblearn.under_sampling.ClusterCentroids` under-samples by replacing the original samples by the centroids of the cluster found. .. GENERATED FROM PYTHON SOURCE LINES 105-131 .. code-block:: Python import matplotlib.pyplot as plt from sklearn.cluster import MiniBatchKMeans from imblearn import FunctionSampler from imblearn.pipeline import make_pipeline from imblearn.under_sampling import ClusterCentroids X, y = create_dataset(n_samples=400, weights=(0.05, 0.15, 0.8), class_sep=0.8) samplers = { FunctionSampler(), # identity resampler ClusterCentroids( estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=0 ), } fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15)) for ax, sampler in zip(axs, samplers): model = make_pipeline(sampler, clf).fit(X, y) plot_decision_function( X, y, model, ax[0], title=f"Decision function with {sampler.__class__.__name__}" ) plot_resampling(X, y, sampler, ax[1]) fig.tight_layout() .. image-sg:: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_001.png :alt: Decision function with ClusterCentroids, Resampling with ClusterCentroids, Decision function with FunctionSampler, Resampling with FunctionSampler :srcset: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 132-144 Prototype selection: under-sampling by selecting existing samples ----------------------------------------------------------------- The algorithm performing prototype selection can be subdivided into two groups: (i) the controlled under-sampling methods and (ii) the cleaning under-sampling methods. With the controlled under-sampling methods, the number of samples to be selected can be specified. :class:`~imblearn.under_sampling.RandomUnderSampler` is the most naive way of performing such selection by randomly selecting a given number of samples by the targeted class. .. GENERATED FROM PYTHON SOURCE LINES 146-165 .. code-block:: Python from imblearn.under_sampling import RandomUnderSampler X, y = create_dataset(n_samples=400, weights=(0.05, 0.15, 0.8), class_sep=0.8) samplers = { FunctionSampler(), # identity resampler RandomUnderSampler(random_state=0), } fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15)) for ax, sampler in zip(axs, samplers): model = make_pipeline(sampler, clf).fit(X, y) plot_decision_function( X, y, model, ax[0], title=f"Decision function with {sampler.__class__.__name__}" ) plot_resampling(X, y, sampler, ax[1]) fig.tight_layout() .. image-sg:: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_002.png :alt: Decision function with FunctionSampler, Resampling with FunctionSampler, Decision function with RandomUnderSampler, Resampling with RandomUnderSampler :srcset: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 166-176 :class:`~imblearn.under_sampling.NearMiss` algorithms implement some heuristic rules in order to select samples. NearMiss-1 selects samples from the majority class for which the average distance of the :math:`k`` nearest samples of the minority class is the smallest. NearMiss-2 selects the samples from the majority class for which the average distance to the farthest samples of the negative class is the smallest. NearMiss-3 is a 2-step algorithm: first, for each minority sample, their :math:`m` nearest-neighbors will be kept; then, the majority samples selected are the on for which the average distance to the :math:`k` nearest neighbors is the largest. .. GENERATED FROM PYTHON SOURCE LINES 178-203 .. code-block:: Python from imblearn.under_sampling import NearMiss X, y = create_dataset(n_samples=1000, weights=(0.05, 0.15, 0.8), class_sep=1.5) samplers = [NearMiss(version=1), NearMiss(version=2), NearMiss(version=3)] fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 25)) for ax, sampler in zip(axs, samplers): model = make_pipeline(sampler, clf).fit(X, y) plot_decision_function( X, y, model, ax[0], title=f"Decision function for {sampler.__class__.__name__}-{sampler.version}", ) plot_resampling( X, y, sampler, ax[1], title=f"Resampling using {sampler.__class__.__name__}-{sampler.version}", ) fig.tight_layout() .. image-sg:: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png :alt: Decision function for NearMiss-1, Resampling using NearMiss-1, Decision function for NearMiss-2, Resampling using NearMiss-2, Decision function for NearMiss-3, Resampling using NearMiss-3 :srcset: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/imblearn/under_sampling/_prototype_selection/_nearmiss.py:203: UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned. warnings.warn( /home/circleci/project/imblearn/under_sampling/_prototype_selection/_nearmiss.py:203: UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned. warnings.warn( /home/circleci/project/imblearn/under_sampling/_prototype_selection/_nearmiss.py:203: UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned. warnings.warn( /home/circleci/project/imblearn/under_sampling/_prototype_selection/_nearmiss.py:203: UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 204-212 :class:`~imblearn.under_sampling.EditedNearestNeighbours` removes samples of the majority class for which their class differ from the one of their nearest-neighbors. This sieve can be repeated which is the principle of the :class:`~imblearn.under_sampling.RepeatedEditedNearestNeighbours`. :class:`~imblearn.under_sampling.AllKNN` is slightly different from the :class:`~imblearn.under_sampling.RepeatedEditedNearestNeighbours` by changing the :math:`k` parameter of the internal nearest neighors algorithm, increasing it at each iteration. .. GENERATED FROM PYTHON SOURCE LINES 214-240 .. code-block:: Python from imblearn.under_sampling import ( AllKNN, EditedNearestNeighbours, RepeatedEditedNearestNeighbours, ) X, y = create_dataset(n_samples=500, weights=(0.2, 0.3, 0.5), class_sep=0.8) samplers = [ EditedNearestNeighbours(), RepeatedEditedNearestNeighbours(), AllKNN(allow_minority=True), ] fig, axs = plt.subplots(3, 2, figsize=(15, 25)) for ax, sampler in zip(axs, samplers): model = make_pipeline(sampler, clf).fit(X, y) plot_decision_function( X, y, clf, ax[0], title=f"Decision function for \n{sampler.__class__.__name__}" ) plot_resampling( X, y, sampler, ax[1], title=f"Resampling using \n{sampler.__class__.__name__}" ) fig.tight_layout() .. image-sg:: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png :alt: Decision function for EditedNearestNeighbours, Resampling using EditedNearestNeighbours, Decision function for RepeatedEditedNearestNeighbours, Resampling using RepeatedEditedNearestNeighbours, Decision function for AllKNN, Resampling using AllKNN :srcset: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 241-252 :class:`~imblearn.under_sampling.CondensedNearestNeighbour` makes use of a 1-NN to iteratively decide if a sample should be kept in a dataset or not. The issue is that :class:`~imblearn.under_sampling.CondensedNearestNeighbour` is sensitive to noise by preserving the noisy samples. :class:`~imblearn.under_sampling.OneSidedSelection` also used the 1-NN and use :class:`~imblearn.under_sampling.TomekLinks` to remove the samples considered noisy. The :class:`~imblearn.under_sampling.NeighbourhoodCleaningRule` use a :class:`~imblearn.under_sampling.EditedNearestNeighbours` to remove some sample. Additionally, they use a 3 nearest-neighbors to remove samples which do not agree with this rule. .. GENERATED FROM PYTHON SOURCE LINES 254-280 .. code-block:: Python from imblearn.under_sampling import ( CondensedNearestNeighbour, NeighbourhoodCleaningRule, OneSidedSelection, ) X, y = create_dataset(n_samples=500, weights=(0.2, 0.3, 0.5), class_sep=0.8) fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 25)) samplers = [ CondensedNearestNeighbour(random_state=0), OneSidedSelection(random_state=0), NeighbourhoodCleaningRule(n_neighbors=11), ] for ax, sampler in zip(axs, samplers): model = make_pipeline(sampler, clf).fit(X, y) plot_decision_function( X, y, clf, ax[0], title=f"Decision function for \n{sampler.__class__.__name__}" ) plot_resampling( X, y, sampler, ax[1], title=f"Resampling using \n{sampler.__class__.__name__}" ) fig.tight_layout() .. image-sg:: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_005.png :alt: Decision function for CondensedNearestNeighbour, Resampling using CondensedNearestNeighbour, Decision function for OneSidedSelection, Resampling using OneSidedSelection, Decision function for NeighbourhoodCleaningRule, Resampling using NeighbourhoodCleaningRule :srcset: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 281-284 :class:`~imblearn.under_sampling.InstanceHardnessThreshold` uses the prediction of classifier to exclude samples. All samples which are classified with a low probability will be removed. .. GENERATED FROM PYTHON SOURCE LINES 286-312 .. code-block:: Python from imblearn.under_sampling import InstanceHardnessThreshold samplers = { FunctionSampler(), # identity resampler InstanceHardnessThreshold( estimator=LogisticRegression(), random_state=0, ), } fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15)) for ax, sampler in zip(axs, samplers): model = make_pipeline(sampler, clf).fit(X, y) plot_decision_function( X, y, model, ax[0], title=f"Decision function with \n{sampler.__class__.__name__}", ) plot_resampling( X, y, sampler, ax[1], title=f"Resampling using \n{sampler.__class__.__name__}" ) fig.tight_layout() plt.show() .. image-sg:: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_006.png :alt: Decision function with InstanceHardnessThreshold, Resampling using InstanceHardnessThreshold, Decision function with FunctionSampler, Resampling using FunctionSampler :srcset: /auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_006.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 9.581 seconds) **Estimated memory usage:** 47 MB .. _sphx_glr_download_auto_examples_under-sampling_plot_comparison_under_sampling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_comparison_under_sampling.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_comparison_under_sampling.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_