6. Miscellaneous samplers#
6.1. Custom samplers#
A fully customized sampler, FunctionSampler
, is available in
imbalanced-learn such that you can fast prototype your own sampler by defining
a single function. Additional parameters can be added using the attribute
kw_args
which accepts a dictionary. The following example illustrates how
to retain the 10 first elements of the array X
and y
:
>>> import numpy as np
>>> from imblearn import FunctionSampler
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> def func(X, y):
... return X[:10], y[:10]
>>> sampler = FunctionSampler(func=func)
>>> X_res, y_res = sampler.fit_resample(X, y)
>>> np.all(X_res == X[:10])
True
>>> np.all(y_res == y[:10])
True
In addition, the parameter validate
controls input checking. For instance,
turning validate=False
allows to pass any type of target y
and do some
sampling for regression targets:
>>> from sklearn.datasets import make_regression
>>> X_reg, y_reg = make_regression(n_samples=100, random_state=42)
>>> rng = np.random.RandomState(42)
>>> def dummy_sampler(X, y):
... indices = rng.choice(np.arange(X.shape[0]), size=10)
... return X[indices], y[indices]
>>> sampler = FunctionSampler(func=dummy_sampler, validate=False)
>>> X_res, y_res = sampler.fit_resample(X_reg, y_reg)
>>> y_res
array([ 41.49112498, -142.78526195, 85.55095317, 141.43321419,
75.46571114, -67.49177372, 159.72700509, -169.80498923,
211.95889757, 211.95889757])
We illustrated the use of such sampler to implement an outlier rejection
estimator which can be easily used within a
Pipeline
:
Customized sampler to implement an outlier rejections estimator
6.2. Custom generators#
Imbalanced-learn provides specific generators for TensorFlow and Keras which will generate balanced mini-batches.
6.2.1. TensorFlow generator#
The balanced_batch_generator
allows to generate
balanced mini-batches using an imbalanced-learn sampler which returns indices.
Let’s first generate some data:
>>> n_features, n_classes = 10, 2
>>> X, y = make_classification(
... n_samples=10_000, n_features=n_features, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=n_classes,
... n_clusters_per_class=1, weights=[0.1, 0.9],
... class_sep=0.8, random_state=0
... )
>>> X = X.astype(np.float32)
Then, we can create the generator that will yield mini-batches that will be balanced:
>>> from imblearn.under_sampling import RandomUnderSampler
>>> from imblearn.tensorflow import balanced_batch_generator
>>> training_generator, steps_per_epoch = balanced_batch_generator(
... X,
... y,
... sample_weight=None,
... sampler=RandomUnderSampler(),
... batch_size=32,
... random_state=42,
... )
The generator
and steps_per_epoch
are used during the training of a
Tensorflow model. We will illustrate how to use this generator. First, we can
define a logistic regression model which will be optimized by a gradient
descent:
>>> import tensorflow as tf
>>> # initialize the weights and intercept
>>> normal_initializer = tf.random_normal_initializer(mean=0, stddev=0.01)
>>> coef = tf.Variable(normal_initializer(
... shape=[n_features, n_classes]), dtype="float32"
... )
>>> intercept = tf.Variable(
... normal_initializer(shape=[n_classes]), dtype="float32"
... )
>>> # define the model
>>> def logistic_regression(X):
... return tf.nn.softmax(tf.matmul(X, coef) + intercept)
>>> # define the loss function
>>> def cross_entropy(y_true, y_pred):
... y_true = tf.one_hot(y_true, depth=n_classes)
... y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
... return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))
>>> # define our metric
>>> def balanced_accuracy(y_true, y_pred):
... cm = tf.math.confusion_matrix(tf.cast(y_true, tf.int64), tf.argmax(y_pred, 1))
... per_class = np.diag(cm) / tf.math.reduce_sum(cm, axis=1)
... return np.mean(per_class)
>>> # define the optimizer
>>> optimizer = tf.optimizers.SGD(learning_rate=0.01)
>>> # define the optimization step
>>> def run_optimization(X, y):
... with tf.GradientTape() as g:
... y_pred = logistic_regression(X)
... loss = cross_entropy(y, y_pred)
... gradients = g.gradient(loss, [coef, intercept])
... optimizer.apply_gradients(zip(gradients, [coef, intercept]))
Once initialized, the model is trained by iterating on balanced mini-batches of data and minimizing the loss previously defined:
>>> epochs = 10
>>> for e in range(epochs):
... y_pred = logistic_regression(X)
... loss = cross_entropy(y, y_pred)
... bal_acc = balanced_accuracy(y, y_pred)
... print(f"epoch: {e}, loss: {loss:.3f}, accuracy: {bal_acc}")
... for i in range(steps_per_epoch):
... X_batch, y_batch = next(training_generator)
... run_optimization(X_batch, y_batch)
epoch: 0, ...
6.2.2. Keras generator#
Keras provides an higher level API in which a model can be defined and train by
calling fit_generator
method to train the model. To illustrate, we will
define a logistic regression model:
>>> from tensorflow import keras
>>> y = keras.utils.to_categorical(y, 3)
>>> model = keras.Sequential()
>>> model.add(
... keras.layers.Dense(
... y.shape[1], input_dim=X.shape[1], activation='softmax'
... )
... )
>>> model.compile(
... optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy']
... )
balanced_batch_generator
creates a balanced
mini-batches generator with the associated number of mini-batches which will be
generated:
>>> from imblearn.keras import balanced_batch_generator
>>> training_generator, steps_per_epoch = balanced_batch_generator(
... X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42
... )
Then, fit
can be called passing the generator and the step:
>>> callback_history = model.fit(
... training_generator,
... steps_per_epoch=steps_per_epoch,
... epochs=10,
... verbose=1,
... )
Epoch 1/10 ...
The second possibility is to use
BalancedBatchGenerator
. Only an instance of this class
will be passed to fit
:
>>> from imblearn.keras import BalancedBatchGenerator
>>> training_generator = BalancedBatchGenerator(
... X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42
... )
>>> callback_history = model.fit(
... training_generator,
... steps_per_epoch=steps_per_epoch,
... epochs=10,
... verbose=1,
... )
Epoch 1/10 ...