9. Dataset loading utilities#

The imblearn.datasets package is complementing the sklearn.datasets package. The package provides both: (i) a set of imbalanced datasets to perform systematic benchmark and (ii) a utility to create an imbalanced dataset from an original balanced dataset.

9.1. Imbalanced datasets for benchmark#

fetch_datasets allows to fetch 27 datasets which are imbalanced and binarized. The following data sets are available:

ID

Name

Repository & Target

Ratio

#S

#F

1

ecoli

UCI, target: imU

8.6:1

336

7

2

optical_digits

UCI, target: 8

9.1:1

5,620

64

3

satimage

UCI, target: 4

9.3:1

6,435

36

4

pen_digits

UCI, target: 5

9.4:1

10,992

16

5

abalone

UCI, target: 7

9.7:1

4,177

10

6

sick_euthyroid

UCI, target: sick euthyroid

9.8:1

3,163

42

7

spectrometer

UCI, target: >=44

11:1

531

93

8

car_eval_34

UCI, target: good, v good

12:1

1,728

21

9

isolet

UCI, target: A, B

12:1

7,797

617

10

us_crime

UCI, target: >0.65

12:1

1,994

100

11

yeast_ml8

LIBSVM, target: 8

13:1

2,417

103

12

scene

LIBSVM, target: >one label

13:1

2,407

294

13

libras_move

UCI, target: 1

14:1

360

90

14

thyroid_sick

UCI, target: sick

15:1

3,772

52

15

coil_2000

KDD, CoIL, target: minority

16:1

9,822

85

16

arrhythmia

UCI, target: 06

17:1

452

278

17

solar_flare_m0

UCI, target: M->0

19:1

1,389

32

18

oil

UCI, target: minority

22:1

937

49

19

car_eval_4

UCI, target: vgood

26:1

1,728

21

20

wine_quality

UCI, wine, target: <=4

26:1

4,898

11

21

letter_img

UCI, target: Z

26:1

20,000

16

22

yeast_me2

UCI, target: ME2

28:1

1,484

8

23

webpage

LIBSVM, w7a, target: minority

33:1

34,780

300

24

ozone_level

UCI, ozone, data

34:1

2,536

72

25

mammography

UCI, target: minority

42:1

11,183

6

26

protein_homo

KDD CUP 2004, minority

11:1

145,751

74

27

abalone_19

UCI, target: 19

130:1

4,177

10

A specific data set can be selected as:

>>> from collections import Counter
>>> from imblearn.datasets import fetch_datasets
>>> ecoli = fetch_datasets()['ecoli']
>>> ecoli.data.shape
(336, 7)
>>> print(sorted(Counter(ecoli.target).items()))
[(-1, 301), (1, 35)]

9.2. Imbalanced generator#

make_imbalance turns an original dataset into an imbalanced dataset. This behaviour is driven by the parameter sampling_strategy which behave similarly to other resampling algorithm. sampling_strategy can be given as a dictionary where the key corresponds to the class and the value is the number of samples in the class:

>>> from sklearn.datasets import load_iris
>>> from imblearn.datasets import make_imbalance
>>> iris = load_iris()
>>> sampling_strategy = {0: 20, 1: 30, 2: 40}
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target,
...                               sampling_strategy=sampling_strategy)
>>> sorted(Counter(y_imb).items())
[(0, 20), (1, 30), (2, 40)]

Note that all samples of a class are passed-through if the class is not mentioned in the dictionary:

>>> sampling_strategy = {0: 10}
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target,
...                               sampling_strategy=sampling_strategy)
>>> sorted(Counter(y_imb).items())
[(0, 10), (1, 50), (2, 50)]

Instead of a dictionary, a function can be defined and directly pass to sampling_strategy:

>>> def ratio_multiplier(y):
...     multiplier = {0: 0.5, 1: 0.7, 2: 0.95}
...     target_stats = Counter(y)
...     for key, value in target_stats.items():
...         target_stats[key] = int(value * multiplier[key])
...     return target_stats
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target,
...                               sampling_strategy=ratio_multiplier)
>>> sorted(Counter(y_imb).items())
[(0, 25), (1, 35), (2, 47)]

It would also work with pandas dataframe:

>>> from sklearn.datasets import fetch_openml
>>> df, y = fetch_openml(
...     'iris', version=1, return_X_y=True, as_frame=True)
>>> df_resampled, y_resampled = make_imbalance(
...     df, y, sampling_strategy={'Iris-setosa': 10, 'Iris-versicolor': 20},
...     random_state=42)
>>> df_resampled.head()
        sepallength  sepalwidth  petallength  petalwidth
  13          4.3         3.0          1.1         0.1
  39          5.1         3.4          1.5         0.2
  30          4.8         3.1          1.6         0.2
  45          4.8         3.0          1.4         0.3
  17          5.1         3.5          1.4         0.3
>>> Counter(y_resampled)
Counter({'Iris-virginica': 50, 'Iris-versicolor': 20, 'Iris-setosa': 10})

See Create an imbalanced dataset and How to use sampling_strategy in imbalanced-learn.