imblearn.under_sampling
.TomekLinks¶
-
class
imblearn.under_sampling.
TomekLinks
(sampling_strategy='auto', return_indices=False, random_state=None, n_jobs=1, ratio=None)[source][source]¶ Class to perform under-sampling by removing Tomek’s links.
Read more in the User Guide.
Parameters: - sampling_strategy : str, list or callable
Sampling information to sample the data set.
When
str
, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:'majority'
: resample only the majority class;'not minority'
: resample all classes but the minority class;'not majority'
: resample all classes but the majority class;'all'
: resample all classes;'auto'
: equivalent to'not minority'
.When
list
, the list contains the classes targeted by the resampling.When callable, function taking
y
and returns adict
. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
- return_indices : bool, optional (default=False)
Whether or not to return the indices of the samples randomly selected.
Deprecated since version 0.4:
return_indices
is deprecated. Use the attributesample_indices_
instead.- random_state : int, RandomState instance or None, optional (default=None)
Control the randomization of the algorithm.
- If int,
random_state
is the seed used by the random number generator; - If
RandomState
instance, random_state is the random number generator; - If
None
, the random number generator is theRandomState
instance used bynp.random
.
Deprecated since version 0.4:
random_state
is deprecated in 0.4 and will be removed in 0.6.- If int,
- n_jobs : int, optional (default=1)
The number of threads to open if possible.
- ratio : str, dict, or callable
Deprecated since version 0.4: Use the parameter
sampling_strategy
instead. It will be removed in 0.6.
Notes
This method is based on [1].
Supports multi-class resampling. A one-vs.-rest scheme is used as originally proposed in [1].
References
[1] (1, 2, 3) I. Tomek, “Two modifications of CNN,” In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010. Examples
>>> from collections import Counter >>> from sklearn.datasets import make_classification >>> from imblearn.under_sampling import TomekLinks # doctest: +NORMALIZE_WHITESPACE >>> X, y = make_classification(n_classes=2, class_sep=2, ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>> print('Original dataset shape %s' % Counter(y)) Original dataset shape Counter({1: 900, 0: 100}) >>> tl = TomekLinks() >>> X_res, y_res = tl.fit_resample(X, y) >>> print('Resampled dataset shape %s' % Counter(y_res)) Resampled dataset shape Counter({1: 897, 0: 100})
Attributes: - sample_indices_ : ndarray, shape (n_new_samples)
Indices of the samples selected.
New in version 0.4:
sample_indices_
used instead ofreturn_indices=True
.
-
__init__
(sampling_strategy='auto', return_indices=False, random_state=None, n_jobs=1, ratio=None)[source][source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(X, y)[source]¶ Check inputs and statistics of the sampler.
You should use
fit_resample
in all cases.Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Data array.
- y : array-like, shape (n_samples,)
Target array.
Returns: - self : object
Return the instance itself.
-
fit_resample
(X, y)[source]¶ Resample the dataset.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: - X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampled : array-like, shape (n_samples_new,)
The corresponding label of X_resampled.
-
fit_sample
(X, y)[source]¶ Resample the dataset.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: - X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampled : array-like, shape (n_samples_new,)
The corresponding label of X_resampled.
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
static
is_tomek
(y, nn_index, class_type)[source][source]¶ is_tomek uses the target vector and the first neighbour of every sample point and looks for Tomek pairs. Returning a boolean vector with True for majority Tomek links.
Parameters: - y : ndarray, shape (n_samples, )
Target vector of the data set, necessary to keep track of whether a sample belongs to minority or not
- nn_index : ndarray, shape (len(y), )
The index of the closes nearest neighbour to a sample point.
- class_type : int or str
The label of the minority class.
Returns: - is_tomek : ndarray, shape (len(y), )
Boolean vector on len( # samples ), with True for majority samples that are Tomek links.
-
set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self