.. _datasets: ========================= Dataset loading utilities ========================= .. currentmodule:: imblearn.datasets The :mod:`imblearn.datasets` package is complementing the :mod:`sklearn.datasets` package. The package provides both: (i) a set of imbalanced datasets to perform systematic benchmark and (ii) a utility to create an imbalanced dataset from an original balanced dataset. .. _zenodo: Imbalanced datasets for benchmark ================================= :func:`fetch_datasets` allows to fetch 27 datasets which are imbalanced and binarized. The following data sets are available: +--+--------------+-------------------------------+-------+---------+-----+ |ID|Name | Repository & Target | Ratio | #S | #F | +==+==============+===============================+=======+=========+=====+ |1 |ecoli | UCI, target: imU | 8.6:1 | 336 | 7 | +--+--------------+-------------------------------+-------+---------+-----+ |2 |optical_digits| UCI, target: 8 | 9.1:1 | 5,620 | 64 | +--+--------------+-------------------------------+-------+---------+-----+ |3 |satimage | UCI, target: 4 | 9.3:1 | 6,435 | 36 | +--+--------------+-------------------------------+-------+---------+-----+ |4 |pen_digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 | +--+--------------+-------------------------------+-------+---------+-----+ |5 |abalone | UCI, target: 7 | 9.7:1 | 4,177 | 10 | +--+--------------+-------------------------------+-------+---------+-----+ |6 |sick_euthyroid| UCI, target: sick euthyroid | 9.8:1 | 3,163 | 42 | +--+--------------+-------------------------------+-------+---------+-----+ |7 |spectrometer | UCI, target: >=44 | 11:1 | 531 | 93 | +--+--------------+-------------------------------+-------+---------+-----+ |8 |car_eval_34 | UCI, target: good, v good | 12:1 | 1,728 | 21 | +--+--------------+-------------------------------+-------+---------+-----+ |9 |isolet | UCI, target: A, B | 12:1 | 7,797 | 617 | +--+--------------+-------------------------------+-------+---------+-----+ |10|us_crime | UCI, target: >0.65 | 12:1 | 1,994 | 100 | +--+--------------+-------------------------------+-------+---------+-----+ |11|yeast_ml8 | LIBSVM, target: 8 | 13:1 | 2,417 | 103 | +--+--------------+-------------------------------+-------+---------+-----+ |12|scene | LIBSVM, target: >one label | 13:1 | 2,407 | 294 | +--+--------------+-------------------------------+-------+---------+-----+ |13|libras_move | UCI, target: 1 | 14:1 | 360 | 90 | +--+--------------+-------------------------------+-------+---------+-----+ |14|thyroid_sick | UCI, target: sick | 15:1 | 3,772 | 52 | +--+--------------+-------------------------------+-------+---------+-----+ |15|coil_2000 | KDD, CoIL, target: minority | 16:1 | 9,822 | 85 | +--+--------------+-------------------------------+-------+---------+-----+ |16|arrhythmia | UCI, target: 06 | 17:1 | 452 | 278 | +--+--------------+-------------------------------+-------+---------+-----+ |17|solar_flare_m0| UCI, target: M->0 | 19:1 | 1,389 | 32 | +--+--------------+-------------------------------+-------+---------+-----+ |18|oil | UCI, target: minority | 22:1 | 937 | 49 | +--+--------------+-------------------------------+-------+---------+-----+ |19|car_eval_4 | UCI, target: vgood | 26:1 | 1,728 | 21 | +--+--------------+-------------------------------+-------+---------+-----+ |20|wine_quality | UCI, wine, target: <=4 | 26:1 | 4,898 | 11 | +--+--------------+-------------------------------+-------+---------+-----+ |21|letter_img | UCI, target: Z | 26:1 | 20,000 | 16 | +--+--------------+-------------------------------+-------+---------+-----+ |22|yeast_me2 | UCI, target: ME2 | 28:1 | 1,484 | 8 | +--+--------------+-------------------------------+-------+---------+-----+ |23|webpage | LIBSVM, w7a, target: minority | 33:1 | 34,780 | 300 | +--+--------------+-------------------------------+-------+---------+-----+ |24|ozone_level | UCI, ozone, data | 34:1 | 2,536 | 72 | +--+--------------+-------------------------------+-------+---------+-----+ |25|mammography | UCI, target: minority | 42:1 | 11,183 | 6 | +--+--------------+-------------------------------+-------+---------+-----+ |26|protein_homo | KDD CUP 2004, minority | 11:1 | 145,751 | 74 | +--+--------------+-------------------------------+-------+---------+-----+ |27|abalone_19 | UCI, target: 19 | 130:1 | 4,177 | 10 | +--+--------------+-------------------------------+-------+---------+-----+ A specific data set can be selected as:: >>> from collections import Counter >>> from imblearn.datasets import fetch_datasets >>> ecoli = fetch_datasets()['ecoli'] >>> ecoli.data.shape (336, 7) >>> print(sorted(Counter(ecoli.target).items())) [(-1, 301), (1, 35)] .. _make_imbalanced: Imbalanced generator ==================== :func:`make_imbalance` turns an original dataset into an imbalanced dataset. This behaviour is driven by the parameter ``sampling_strategy`` which behave similarly to other resampling algorithm. ``sampling_strategy`` can be given as a dictionary where the key corresponds to the class and the value is the number of samples in the class:: >>> from sklearn.datasets import load_iris >>> from imblearn.datasets import make_imbalance >>> iris = load_iris() >>> sampling_strategy = {0: 20, 1: 30, 2: 40} >>> X_imb, y_imb = make_imbalance(iris.data, iris.target, ... sampling_strategy=sampling_strategy) >>> sorted(Counter(y_imb).items()) [(0, 20), (1, 30), (2, 40)] Note that all samples of a class are passed-through if the class is not mentioned in the dictionary:: >>> sampling_strategy = {0: 10} >>> X_imb, y_imb = make_imbalance(iris.data, iris.target, ... sampling_strategy=sampling_strategy) >>> sorted(Counter(y_imb).items()) [(0, 10), (1, 50), (2, 50)] Instead of a dictionary, a function can be defined and directly pass to ``sampling_strategy``:: >>> def ratio_multiplier(y): ... multiplier = {0: 0.5, 1: 0.7, 2: 0.95} ... target_stats = Counter(y) ... for key, value in target_stats.items(): ... target_stats[key] = int(value * multiplier[key]) ... return target_stats >>> X_imb, y_imb = make_imbalance(iris.data, iris.target, ... sampling_strategy=ratio_multiplier) >>> sorted(Counter(y_imb).items()) [(0, 25), (1, 35), (2, 47)] It would also work with pandas dataframe:: >>> from sklearn.datasets import fetch_openml >>> df, y = fetch_openml( ... 'iris', version=1, return_X_y=True, as_frame=True) >>> df_resampled, y_resampled = make_imbalance( ... df, y, sampling_strategy={'Iris-setosa': 10, 'Iris-versicolor': 20}, ... random_state=42) >>> df_resampled.head() sepallength sepalwidth petallength petalwidth 13 4.3 3.0 1.1 0.1 39 5.1 3.4 1.5 0.2 30 4.8 3.1 1.6 0.2 45 4.8 3.0 1.4 0.3 17 5.1 3.5 1.4 0.3 >>> Counter(y_resampled) Counter({'Iris-virginica': 50, 'Iris-versicolor': 20, 'Iris-setosa': 10}) See :ref:`sphx_glr_auto_examples_datasets_plot_make_imbalance.py` and :ref:`sphx_glr_auto_examples_api_plot_sampling_strategy_usage.py`.