.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/api/plot_sampling_strategy_usage.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_api_plot_sampling_strategy_usage.py: ==================================================== How to use ``sampling_strategy`` in imbalanced-learn ==================================================== This example shows the different usage of the parameter ``sampling_strategy`` for the different family of samplers (i.e. over-sampling, under-sampling. or cleaning methods). .. GENERATED FROM PYTHON SOURCE LINES 11-15 .. code-block:: Python # Authors: Guillaume Lemaitre # License: MIT .. GENERATED FROM PYTHON SOURCE LINES 16-21 .. code-block:: Python print(__doc__) import seaborn as sns sns.set_context("poster") .. GENERATED FROM PYTHON SOURCE LINES 22-26 Create an imbalanced dataset ---------------------------- First, we will create an imbalanced data set from a the iris data set. .. GENERATED FROM PYTHON SOURCE LINES 28-37 .. code-block:: Python from sklearn.datasets import load_iris from imblearn.datasets import make_imbalance iris = load_iris(as_frame=True) sampling_strategy = {0: 10, 1: 20, 2: 47} X, y = make_imbalance(iris.data, iris.target, sampling_strategy=sampling_strategy) .. GENERATED FROM PYTHON SOURCE LINES 38-48 .. code-block:: Python import matplotlib.pyplot as plt fig, axs = plt.subplots(ncols=2, figsize=(10, 5)) autopct = "%.2f" iris.target.value_counts().plot.pie(autopct=autopct, ax=axs[0]) axs[0].set_title("Original") y.value_counts().plot.pie(autopct=autopct, ax=axs[1]) axs[1].set_title("Imbalanced") fig.tight_layout() .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_001.png :alt: Original, Imbalanced :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 49-60 Using ``sampling_strategy`` in resampling algorithms ==================================================== `sampling_strategy` as a `float` -------------------------------- `sampling_strategy` can be given a `float`. For **under-sampling methods**, it corresponds to the ratio :math:`\alpha_{us}` defined by :math:`N_{rM} = \alpha_{us} \times N_{m}` where :math:`N_{rM}` and :math:`N_{m}` are the number of samples in the majority class after resampling and the number of samples in the minority class, respectively. .. GENERATED FROM PYTHON SOURCE LINES 62-68 .. code-block:: Python # select only 2 classes since the ratio make sense in this case binary_mask = y.isin([0, 1]) binary_y = y[binary_mask] binary_X = X[binary_mask] .. GENERATED FROM PYTHON SOURCE LINES 69-77 .. code-block:: Python from imblearn.under_sampling import RandomUnderSampler sampling_strategy = 0.8 rus = RandomUnderSampler(sampling_strategy=sampling_strategy) X_res, y_res = rus.fit_resample(binary_X, binary_y) ax = y_res.value_counts().plot.pie(autopct=autopct) _ = ax.set_title("Under-sampling") .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_002.png :alt: Under-sampling :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 78-83 For **over-sampling methods**, it correspond to the ratio :math:`\alpha_{os}` defined by :math:`N_{rm} = \alpha_{os} \times N_{M}` where :math:`N_{rm}` and :math:`N_{M}` are the number of samples in the minority class after resampling and the number of samples in the majority class, respectively. .. GENERATED FROM PYTHON SOURCE LINES 85-92 .. code-block:: Python from imblearn.over_sampling import RandomOverSampler ros = RandomOverSampler(sampling_strategy=sampling_strategy) X_res, y_res = ros.fit_resample(binary_X, binary_y) ax = y_res.value_counts().plot.pie(autopct=autopct) _ = ax.set_title("Over-sampling") .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_003.png :alt: Over-sampling :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 93-101 `sampling_strategy` as a `str` ------------------------------- `sampling_strategy` can be given as a string which specify the class targeted by the resampling. With under- and over-sampling, the number of samples will be equalized. Note that we are using multiple classes from now on. .. GENERATED FROM PYTHON SOURCE LINES 103-117 .. code-block:: Python sampling_strategy = "not minority" fig, axs = plt.subplots(ncols=2, figsize=(10, 5)) rus = RandomUnderSampler(sampling_strategy=sampling_strategy) X_res, y_res = rus.fit_resample(X, y) y_res.value_counts().plot.pie(autopct=autopct, ax=axs[0]) axs[0].set_title("Under-sampling") sampling_strategy = "not majority" ros = RandomOverSampler(sampling_strategy=sampling_strategy) X_res, y_res = ros.fit_resample(X, y) y_res.value_counts().plot.pie(autopct=autopct, ax=axs[1]) _ = axs[1].set_title("Over-sampling") .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_004.png :alt: Under-sampling, Over-sampling :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 118-120 With **cleaning method**, the number of samples in each class will not be equalized even if targeted. .. GENERATED FROM PYTHON SOURCE LINES 122-130 .. code-block:: Python from imblearn.under_sampling import TomekLinks sampling_strategy = "not minority" tl = TomekLinks(sampling_strategy=sampling_strategy) X_res, y_res = tl.fit_resample(X, y) ax = y_res.value_counts().plot.pie(autopct=autopct) _ = ax.set_title("Cleaning") .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_005.png :alt: Cleaning :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 131-138 `sampling_strategy` as a `dict` ------------------------------- When `sampling_strategy` is a `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class. This is working for both **under- and over-sampling** algorithms but not for the **cleaning algorithms**. Use a `list` instead. .. GENERATED FROM PYTHON SOURCE LINES 140-154 .. code-block:: Python fig, axs = plt.subplots(ncols=2, figsize=(10, 5)) sampling_strategy = {0: 10, 1: 15, 2: 20} rus = RandomUnderSampler(sampling_strategy=sampling_strategy) X_res, y_res = rus.fit_resample(X, y) y_res.value_counts().plot.pie(autopct=autopct, ax=axs[0]) axs[0].set_title("Under-sampling") sampling_strategy = {0: 25, 1: 35, 2: 47} ros = RandomOverSampler(sampling_strategy=sampling_strategy) X_res, y_res = ros.fit_resample(X, y) y_res.value_counts().plot.pie(autopct=autopct, ax=axs[1]) _ = axs[1].set_title("Under-sampling") .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_006.png :alt: Under-sampling, Under-sampling :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 155-161 `sampling_strategy` as a `list` ------------------------------- When `sampling_strategy` is a `list`, the list contains the targeted classes. It is used only for **cleaning methods** and raise an error otherwise. .. GENERATED FROM PYTHON SOURCE LINES 163-169 .. code-block:: Python sampling_strategy = [0, 1, 2] tl = TomekLinks(sampling_strategy=sampling_strategy) X_res, y_res = tl.fit_resample(X, y) ax = y_res.value_counts().plot.pie(autopct=autopct) _ = ax.set_title("Cleaning") .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_007.png :alt: Cleaning :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 170-176 `sampling_strategy` as a callable --------------------------------- When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class. .. GENERATED FROM PYTHON SOURCE LINES 179-194 .. code-block:: Python def ratio_multiplier(y): from collections import Counter multiplier = {1: 0.7, 2: 0.95} target_stats = Counter(y) for key, value in target_stats.items(): if key in multiplier: target_stats[key] = int(value * multiplier[key]) return target_stats X_res, y_res = RandomUnderSampler(sampling_strategy=ratio_multiplier).fit_resample(X, y) ax = y_res.value_counts().plot.pie(autopct=autopct) ax.set_title("Under-sampling") plt.show() .. image-sg:: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_008.png :alt: Under-sampling :srcset: /auto_examples/api/images/sphx_glr_plot_sampling_strategy_usage_008.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 2.370 seconds) **Estimated memory usage:** 36 MB .. _sphx_glr_download_auto_examples_api_plot_sampling_strategy_usage.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_sampling_strategy_usage.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_sampling_strategy_usage.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_