KMeansSMOTE#

class imblearn.over_sampling.KMeansSMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=2, n_jobs=None, kmeans_estimator=None, cluster_balance_threshold='auto', density_exponent='auto')[source]#

Apply a KMeans clustering before to over-sample using SMOTE.

This is an implementation of the algorithm described in [1].

See also

SMOTE: Over-sample using SMOTE.
SMOTENC: Over-sample using SMOTE for continuous and categorical features.
SMOTEN: Over-sample using the SMOTE variant specifically for categorical features only.
SVMSMOTE: Over-sample using SVM-SMOTE variant.
BorderlineSMOTE: Over-sample using Borderline-SMOTE variant.
ADASYN: Over-sample using ADASYN.

References

[1]

Felix Last, Georgios Douzas, Fernando Bacao, “Oversampling for Imbalanced Learning Based on K-Means and SMOTE” https://arxiv.org/abs/1711.00837

Examples

>>> import numpy as np
>>> from imblearn.over_sampling import KMeansSMOTE
>>> from sklearn.datasets import make_blobs
>>> blobs = [100, 800, 100]
>>> X, y  = make_blobs(blobs, centers=[(-10, 0), (0,0), (10, 0)], random_state=0)
>>> # Add a single 0 sample in the middle blob
>>> X = np.concatenate([X, [[0, 0]]])
>>> y = np.append(y, 0)
>>> # Make this a binary classification problem
>>> y = y == 1
>>> sm = KMeansSMOTE(
...     kmeans_estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=42
... )
>>> X_res, y_res = sm.fit_resample(X, y)
>>> # Find the number of new samples in the middle blob
>>> n_res_in_middle = ((X_res[:, 0] > -5) & (X_res[:, 0] < 5)).sum()
>>> print("Samples in the middle blob: %s" % n_res_in_middle)
Samples in the middle blob: 801
>>> print("Middle blob unchanged: %s" % (n_res_in_middle == blobs[1] + 1))
Middle blob unchanged: True
>>> print("More 0 samples: %s" % ((y_res == 0).sum() > (y == 0).sum()))
More 0 samples: True

Methods

`fit`(X, y, **params)	Check inputs and statistics of the sampler.
`fit_resample`(X, y, **params)	Resample the dataset.
`get_feature_names_out`([input_features])	Get output feature names for transformation.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y, **params)[source]#

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:

X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features): Data array.
yarray-like of shape (n_samples,): Target array.

Returns:

selfobject: Return the instance itself.

fit_resample(X, y, **params)[source]#

Resample the dataset.

Parameters:

X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features): Matrix containing the data which have to be sampled.
yarray-like of shape (n_samples,): Corresponding label for each sample in X.

Returns:

X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features): The array containing the resampled data.
y_resampledarray-like of shape (n_samples_new,): The corresponding label of X_resampled.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: ["x0", "x1", ..., "x(n_features_in_ - 1)"].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_outndarray of str objects: Same as input features.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

Examples using `imblearn.over_sampling.KMeansSMOTE`#

Compare over-sampling samplers

KMeansSMOTE#

Examples using imblearn.over_sampling.KMeansSMOTE#

Examples using `imblearn.over_sampling.KMeansSMOTE`#