.. _introduction: ============ Introduction ============ .. _api_imblearn: API's of imbalanced-learn samplers ---------------------------------- The available samplers follow the `scikit-learn API `_ using the base estimator and incorporating a sampling functionality via the ``sample`` method: :Estimator: The base object, implements a ``fit`` method to learn from data:: estimator = obj.fit(data, targets) :Resampler: To resample a data sets, each sampler implements a ``fit_resample`` method:: data_resampled, targets_resampled = obj.fit_resample(data, targets) Imbalanced-learn samplers accept the same inputs as scikit-learn estimators: * `data`, 2-dimensional array-like structures, such as: * Python's list of lists :class:`list`, * Numpy arrays :class:`numpy.ndarray`, * Panda dataframes :class:`pandas.DataFrame`, * Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`; * `targets`, 1-dimensional array-like structures, such as: * Numpy arrays :class:`numpy.ndarray`, * Pandas series :class:`pandas.Series`. The output will be of the following type: * `data_resampled`, 2-dimensional aray-like structures, such as: * Numpy arrays :class:`numpy.ndarray`, * Pandas dataframes :class:`pandas.DataFrame`, * Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`; * `targets_resampled`, 1-dimensional array-like structures, such as: * Numpy arrays :class:`numpy.ndarray`, * Pandas series :class:`pandas.Series`. .. topic:: Pandas in/out Unlike scikit-learn, imbalanced-learn provides support for pandas in/out. Therefore providing a dataframe, will output as well a dataframe. .. topic:: Sparse input For sparse input the data is **converted to the Compressed Sparse Rows representation** (see ``scipy.sparse.csr_matrix``) before being fed to the sampler. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream. .. _problem_statement: Problem statement regarding imbalanced data sets ------------------------------------------------ The learning and prediction phrases of machine learning algorithms can be impacted by the issue of **imbalanced datasets**. This imbalance refers to the difference in the number of samples across different classes. We demonstrate the effect of training a `Logistic Regression classifier `_ with varying levels of class balancing by adjusting their weights. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_001.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html :scale: 60 :align: center As expected, the decision function of the Logistic Regression classifier varies significantly depending on how imbalanced the data is. With a greater imbalance ratio, the decision function tends to favour the class with the larger number of samples, usually referred to as the **majority class**.