imblearn.under_sampling.OneSidedSelection

class imblearn.under_sampling.OneSidedSelection(sampling_strategy='auto', random_state=None, n_neighbors=None, n_seeds_S=1, n_jobs=None)[source]

Class to perform under-sampling based on one-sided selection method.

Read more in the User Guide.

Parameters
sampling_strategystr, list or callable

Sampling information to sample the data set.

  • When str, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:

    'majority': resample only the majority class;

    'not minority': resample all classes but the minority class;

    'not majority': resample all classes but the majority class;

    'all': resample all classes;

    'auto': equivalent to 'not minority'.

  • When list, the list contains the classes targeted by the resampling.

  • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

random_stateint, RandomState instance, default=None

Control the randomization of the algorithm.

  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used by np.random.

n_neighborsint or object, default=None

If int, size of the neighbourhood to consider to compute the nearest neighbors. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the nearest-neighbors.

n_seeds_Sint, default=1

Number of samples to extract in order to build the set S.

n_jobsint, default=None

Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

See also

EditedNearestNeighbours

Undersample by editing noisy samples.

Notes

The method is based on [R518cdb13373a-1].

Supports multi-class resampling. A one-vs.-one scheme is used when sampling a class as proposed in [R518cdb13373a-1]. For each class to be sampled, all samples of this class and the minority class are used during the sampling procedure.

References

R518cdb13373a-1(1,2)

M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: one-sided selection,” In ICML, vol. 97, pp. 179-186, 1997.

Examples

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.under_sampling import     OneSidedSelection 
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> oss = OneSidedSelection(random_state=42)
>>> X_res, y_res = oss.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({1: 496, 0: 100})
Attributes
sample_indices_ndarray of shape (n_new_samples)

Indices of the samples selected.

New in version 0.4.

__init__(self, sampling_strategy='auto', random_state=None, n_neighbors=None, n_seeds_S=1, n_jobs=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(self, X, y)[source]

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Data array.

yarray-like of shape (n_samples,)

Target array.

Returns
selfobject

Return the instance itself.

fit_resample(self, X, y)[source]

Resample the dataset.

Parameters
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

yarray-like of shape (n_samples,)

Corresponding label for each sample in X.

Returns
X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampledarray-like of shape (n_samples_new,)

The corresponding label of X_resampled.

fit_sample(self, X, y)[source]

Resample the dataset.

Parameters
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

yarray-like of shape (n_samples,)

Corresponding label for each sample in X.

Returns
X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampledarray-like of shape (n_samples_new,)

The corresponding label of X_resampled.

get_params(self, deep=True)[source]

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

set_params(self, **params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfobject

Estimator instance.

Examples using imblearn.under_sampling.OneSidedSelection