NeighbourhoodCleaningRule#
- class imblearn.under_sampling.NeighbourhoodCleaningRule(*, sampling_strategy='auto', edited_nearest_neighbours=None, n_neighbors=3, kind_sel='deprecated', threshold_cleaning=0.5, n_jobs=None)[source]#
Undersample based on the neighbourhood cleaning rule.
This class uses ENN and a k-NN to remove noisy samples from the datasets.
Read more in the User Guide.
- Parameters:
- sampling_strategystr, list or callable
Sampling information to sample the data set.
When
str
, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:'majority'
: resample only the majority class;'not minority'
: resample all classes but the minority class;'not majority'
: resample all classes but the majority class;'all'
: resample all classes;'auto'
: equivalent to'not minority'
.When
list
, the list contains the classes targeted by the resampling.When callable, function taking
y
and returns adict
. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
- edited_nearest_neighboursestimator object, default=None
The
EditedNearestNeighbours
(ENN) object to clean the dataset. IfNone
, a default ENN is created withkind_sel="mode"
andn_neighbors=n_neighbors
.- n_neighborsint or estimator object, default=3
If
int
, size of the neighbourhood to consider to compute the K-nearest neighbors. If object, an estimator that inherits fromKNeighborsMixin
that will be used to find the nearest-neighbors. By default, it will be a 3-NN.- kind_sel{“all”, “mode”}, default=’all’
Strategy to use in order to exclude samples in the ENN sampling.
If
'all'
, all neighbours will have to agree with the samples of interest to not be excluded.If
'mode'
, the majority vote of the neighbours will be used in order to exclude a sample.
The strategy
"all"
will be less conservative than'mode'
. Thus, more samples will be removed whenkind_sel="all"
generally.Deprecated since version 0.12:
kind_sel
is deprecated in 0.12 and will be removed in 0.14. Currently the parameter has no effect and corresponds always to the"all"
strategy.- threshold_cleaningfloat, default=0.5
Threshold used to whether consider a class or not during the cleaning after applying ENN. A class will be considered during cleaning when:
Ci > C x T ,
where Ci and C is the number of samples in the class and the data set, respectively and theta is the threshold.
- n_jobsint, default=None
Number of CPU cores used during the cross-validation loop.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.
- Attributes:
- sampling_strategy_dict
Dictionary containing the information to sample the dataset. The keys corresponds to the class labels from which to sample and the values are the number of samples to sample.
- edited_nearest_neighbours_estimator object
The edited nearest neighbour object used to make the first resampling.
- nn_estimator object
Validated K-nearest Neighbours object created from
n_neighbors
parameter.- classes_to_clean_list
The classes considered with under-sampling by
nn_
in the second cleaning phase.- sample_indices_ndarray of shape (n_new_samples,)
Indices of the samples selected.
New in version 0.4.
- n_features_in_int
Number of features in the input dataset.
New in version 0.9.
- feature_names_in_ndarray of shape (
n_features_in_
,) Names of features seen during
fit
. Defined only whenX
has feature names that are all strings.New in version 0.10.
See also
EditedNearestNeighbours
Undersample by editing noisy samples.
Notes
See the original paper: [1].
Supports multi-class resampling. A one-vs.-rest scheme is used when sampling a class as proposed in [1].
References
Examples
>>> from collections import Counter >>> from sklearn.datasets import make_classification >>> from imblearn.under_sampling import NeighbourhoodCleaningRule >>> X, y = make_classification(n_classes=2, class_sep=2, ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>> print('Original dataset shape %s' % Counter(y)) Original dataset shape Counter({1: 900, 0: 100}) >>> ncr = NeighbourhoodCleaningRule() >>> X_res, y_res = ncr.fit_resample(X, y) >>> print('Resampled dataset shape %s' % Counter(y_res)) Resampled dataset shape Counter({1: 888, 0: 100})
Methods
fit
(X, y)Check inputs and statistics of the sampler.
fit_resample
(X, y)Resample the dataset.
get_feature_names_out
([input_features])Get output feature names for transformation.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
- fit(X, y)[source]#
Check inputs and statistics of the sampler.
You should use
fit_resample
in all cases.- Parameters:
- X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Data array.
- yarray-like of shape (n_samples,)
Target array.
- Returns:
- selfobject
Return the instance itself.
- fit_resample(X, y)[source]#
Resample the dataset.
- Parameters:
- X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- yarray-like of shape (n_samples,)
Corresponding label for each sample in X.
- Returns:
- X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampledarray-like of shape (n_samples_new,)
The corresponding label of
X_resampled
.
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation.
- Parameters:
- input_featuresarray-like of str or None, default=None
Input features.
If
input_features
isNone
, thenfeature_names_in_
is used as feature names in. Iffeature_names_in_
is not defined, then the following input feature names are generated:["x0", "x1", ..., "x(n_features_in_ - 1)"]
.If
input_features
is an array-like, theninput_features
must matchfeature_names_in_
iffeature_names_in_
is defined.
- Returns:
- feature_names_outndarray of str objects
Same as input features.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
Examples using imblearn.under_sampling.NeighbourhoodCleaningRule
#
Compare under-sampling samplers