imbalanced-learn
stable

Getting Started

  • Install and contribution

Documentation

  • User Guide
  • imbalanced-learn API
    • imblearn.under_sampling: Under-sampling methods
      • Prototype generation
      • Prototype selection
        • imblearn.under_sampling.CondensedNearestNeighbour
        • imblearn.under_sampling.EditedNearestNeighbours
        • imblearn.under_sampling.RepeatedEditedNearestNeighbours
        • imblearn.under_sampling.AllKNN
        • imblearn.under_sampling.InstanceHardnessThreshold
        • imblearn.under_sampling.NearMiss
        • imblearn.under_sampling.NeighbourhoodCleaningRule
        • imblearn.under_sampling.OneSidedSelection
        • imblearn.under_sampling.RandomUnderSampler
        • imblearn.under_sampling.TomekLinks
    • imblearn.over_sampling: Over-sampling methods
    • imblearn.combine: Combination of over- and under-sampling methods
    • imblearn.ensemble: Ensemble methods
    • imblearn.keras: Batch generator for Keras
    • imblearn.tensorflow: Batch generator for TensorFlow
    • Miscellaneous
    • imblearn.pipeline: Pipeline
    • imblearn.metrics: Metrics
    • imblearn.datasets: Datasets
    • imblearn.utils: Utilities

Tutorial - Examples

  • General examples
  • Examples based on real world datasets
  • Dataset examples
  • Evaluation examples
  • Model Selection

Addtional Information

  • Release history
  • About us
imbalanced-learn
  • Docs »
  • imbalanced-learn API »
  • imblearn.under_sampling.TomekLinks
  • Edit on GitHub

imblearn.under_sampling.TomekLinks¶

class imblearn.under_sampling.TomekLinks(sampling_strategy='auto', return_indices=False, random_state=None, n_jobs=1, ratio=None)[source][source]¶

Class to perform under-sampling by removing Tomek’s links.

Read more in the User Guide.

Parameters:
sampling_strategy : str, list or callable

Sampling information to sample the data set.

  • When str, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:

    'majority': resample only the majority class;

    'not minority': resample all classes but the minority class;

    'not majority': resample all classes but the majority class;

    'all': resample all classes;

    'auto': equivalent to 'not minority'.

  • When list, the list contains the classes targeted by the resampling.

  • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

return_indices : bool, optional (default=False)

Whether or not to return the indices of the samples randomly selected.

Deprecated since version 0.4: return_indices is deprecated. Use the attribute sample_indices_ instead.

random_state : int, RandomState instance or None, optional (default=None)

Control the randomization of the algorithm.

  • If int, random_state is the seed used by the random number generator;
  • If RandomState instance, random_state is the random number generator;
  • If None, the random number generator is the RandomState instance used by np.random.

Deprecated since version 0.4: random_state is deprecated in 0.4 and will be removed in 0.6.

n_jobs : int, optional (default=1)

The number of threads to open if possible.

ratio : str, dict, or callable

Deprecated since version 0.4: Use the parameter sampling_strategy instead. It will be removed in 0.6.

Notes

This method is based on [1].

Supports multi-class resampling. A one-vs.-rest scheme is used as originally proposed in [1].

References

[1](1, 2, 3) I. Tomek, “Two modifications of CNN,” In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010.

Examples

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.under_sampling import TomekLinks # doctest: +NORMALIZE_WHITESPACE
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> tl = TomekLinks()
>>> X_res, y_res = tl.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({1: 897, 0: 100})
Attributes:
sample_indices_ : ndarray, shape (n_new_samples)

Indices of the samples selected.

New in version 0.4: sample_indices_ used instead of return_indices=True.

__init__(sampling_strategy='auto', return_indices=False, random_state=None, n_jobs=1, ratio=None)[source][source]¶

Initialize self. See help(type(self)) for accurate signature.

fit(X, y)[source]¶

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Data array.

y : array-like, shape (n_samples,)

Target array.

Returns:
self : object

Return the instance itself.

fit_resample(X, y)[source]¶

Resample the dataset.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:
X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like, shape (n_samples_new,)

The corresponding label of X_resampled.

fit_sample(X, y)[source]¶

Resample the dataset.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:
X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like, shape (n_samples_new,)

The corresponding label of X_resampled.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

static is_tomek(y, nn_index, class_type)[source][source]¶

is_tomek uses the target vector and the first neighbour of every sample point and looks for Tomek pairs. Returning a boolean vector with True for majority Tomek links.

Parameters:
y : ndarray, shape (n_samples, )

Target vector of the data set, necessary to keep track of whether a sample belongs to minority or not

nn_index : ndarray, shape (len(y), )

The index of the closes nearest neighbour to a sample point.

class_type : int or str

The label of the minority class.

Returns:
is_tomek : ndarray, shape (len(y), )

Boolean vector on len( # samples ), with True for majority samples that are Tomek links.

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self

Examples using imblearn.under_sampling.TomekLinks¶

../_images/sphx_glr_plot_sampling_strategy_usage_thumb.png

Usage of the sampling_strategy parameter for the different algorithms

../_images/sphx_glr_plot_illustration_tomek_links_thumb.png

Illustration of the definition of a Tomek link

Next Previous

© Copyright 2016 - 2017, G. Lemaitre, F. Nogueira, D. Oliveira, C. Aridas Revision 09a068c9.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: stable
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.