imblearn.over_sampling.SMOTENC

class imblearn.over_sampling.SMOTENC(categorical_features, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=1)[source][source]

Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC).

Unlike SMOTE, SMOTE-NC for dataset containing continuous and categorical features.

Read more in the User Guide.

Parameters:
categorical_features : ndarray, shape (n_cat_features,) or (n_features,)

Specified which features are categorical. Can either be:

  • array of indices specifying the categorical features;
  • mask array of shape (n_features, ) and bool dtype for which True indicates the categorical features.
{sampling_strategy}
{random_state}
k_neighbors : int or object, optional (default=5)

If int, number of nearest neighbours to used to construct synthetic samples. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.

n_jobs : int, optional (default=1)

The number of threads to open if possible.

See also

SMOTE
Over-sample using SMOTE.
SVMSMOTE
Over-sample using SVM-SMOTE variant.
BorderlineSMOTE
Over-sample using Borderline-SMOTE variant.
ADASYN
Over-sample using ADASYN.

Notes

See the original paper [1] for more details.

Supports mutli-class resampling. A one-vs.-rest scheme is used as originally proposed in [1].

See Comparison of the different over-sampling algorithms, and sphx_glr_auto_examples_over-sampling_plot_smote.py.

References

[1](1, 2, 3) N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

Examples

>>> from collections import Counter
>>> from numpy.random import RandomState
>>> from sklearn.datasets import make_classification
>>> from imblearn.over_sampling import SMOTENC
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape (%s, %s)' % X.shape)
Original dataset shape (1000, 20)
>>> print('Original dataset samples per class {}'.format(Counter(y)))
Original dataset samples per class Counter({1: 900, 0: 100})
>>> # simulate the 2 last columns to be categorical features
>>> X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))
>>> sm = SMOTENC(random_state=42, categorical_features=[18, 19])
>>> X_res, y_res = sm.fit_resample(X, y)
>>> print('Resampled dataset samples per class {}'.format(Counter(y_res)))
Resampled dataset samples per class Counter({0: 900, 1: 900})
__init__(categorical_features, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=1)[source][source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y)[source]

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Data array.

y : array-like, shape (n_samples,)

Target array.

Returns:
self : object

Return the instance itself.

fit_resample(X, y)[source]

Resample the dataset.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:
X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like, shape (n_samples_new,)

The corresponding label of X_resampled.

fit_sample(X, y)[source]

Resample the dataset.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:
X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like, shape (n_samples_new,)

The corresponding label of X_resampled.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self

Examples using imblearn.over_sampling.SMOTENC