.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/applications/plot_topic_classication.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_applications_plot_topic_classication.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_applications_plot_topic_classication.py:


=================================================
Example of topic classification in text documents
=================================================

This example shows how to balance the text data before to train a classifier.

Note that for this example, the data are slightly imbalanced but it can happen
that for some data sets, the imbalanced ratio is more significant.

.. GENERATED FROM PYTHON SOURCE LINES 11-15

.. code-block:: Python


    # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
    # License: MIT


.. GENERATED FROM PYTHON SOURCE LINES 16-18

.. code-block:: Python

    print(__doc__)


.. GENERATED FROM PYTHON SOURCE LINES 19-27

Setting the data set
--------------------

We use a part of the 20 newsgroups data set by loading 4 topics. Using the
scikit-learn loader, the data are split into a training and a testing set.

Note the class \#3 is the minority class and has almost twice less samples
than the majority class.

.. GENERATED FROM PYTHON SOURCE LINES 29-46

.. code-block:: Python

    from sklearn.datasets import fetch_20newsgroups

    categories = [
        "alt.atheism",
        "talk.religion.misc",
        "comp.graphics",
        "sci.space",
    ]
    newsgroups_train = fetch_20newsgroups(subset="train", categories=categories)
    newsgroups_test = fetch_20newsgroups(subset="test", categories=categories)

    X_train = newsgroups_train.data
    X_test = newsgroups_test.data

    y_train = newsgroups_train.target
    y_test = newsgroups_test.target


.. GENERATED FROM PYTHON SOURCE LINES 47-52

.. code-block:: Python

    from collections import Counter

    print(f"Training class distributions summary: {Counter(y_train)}")
    print(f"Test class distributions summary: {Counter(y_test)}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Training class distributions summary: Counter({np.int64(2): 593, np.int64(1): 584, np.int64(0): 480, np.int64(3): 377})
    Test class distributions summary: Counter({np.int64(2): 394, np.int64(1): 389, np.int64(0): 319, np.int64(3): 251})


.. GENERATED FROM PYTHON SOURCE LINES 53-62

The usual scikit-learn pipeline
-------------------------------

You might usually use scikit-learn pipeline by combining the TF-IDF
vectorizer to feed a multinomial naive bayes classifier. A classification
report summarized the results on the testing set.

As expected, the recall of the class \#3 is low mainly due to the class
imbalanced.

.. GENERATED FROM PYTHON SOURCE LINES 64-72

.. code-block:: Python

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import make_pipeline

    model = make_pipeline(TfidfVectorizer(), MultinomialNB())
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)


.. GENERATED FROM PYTHON SOURCE LINES 73-77

.. code-block:: Python

    from imblearn.metrics import classification_report_imbalanced

    print(classification_report_imbalanced(y_test, y_pred))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                       pre       rec       spe        f1       geo       iba       sup

              0       0.67      0.94      0.86      0.79      0.90      0.82       319
              1       0.96      0.92      0.99      0.94      0.95      0.90       389
              2       0.87      0.98      0.94      0.92      0.96      0.92       394
              3       0.97      0.36      1.00      0.52      0.60      0.33       251

    avg / total       0.87      0.84      0.94      0.82      0.88      0.78      1353


.. GENERATED FROM PYTHON SOURCE LINES 78-89

Balancing the class before classification
-----------------------------------------

To improve the prediction of the class \#3, it could be interesting to apply
a balancing before to train the naive bayes classifier. Therefore, we will
use a :class:`~imblearn.under_sampling.RandomUnderSampler` to equalize the
number of samples in all the classes before the training.

It is also important to note that we are using the
:class:`~imblearn.pipeline.make_pipeline` function implemented in
imbalanced-learn to properly handle the samplers.

.. GENERATED FROM PYTHON SOURCE LINES 89-92

.. code-block:: Python


    from imblearn.pipeline import make_pipeline as make_pipeline_imb


.. GENERATED FROM PYTHON SOURCE LINES 93-100

.. code-block:: Python

    from imblearn.under_sampling import RandomUnderSampler

    model = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)


.. GENERATED FROM PYTHON SOURCE LINES 101-105

Although the results are almost identical, it can be seen that the resampling
allowed to correct the poor recall of the class \#3 at the cost of reducing
the other metrics for the other classes. However, the overall results are
slightly better.

.. GENERATED FROM PYTHON SOURCE LINES 107-108

.. code-block:: Python

    print(classification_report_imbalanced(y_test, y_pred))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                       pre       rec       spe        f1       geo       iba       sup

              0       0.68      0.91      0.87      0.78      0.89      0.79       319
              1       0.98      0.84      0.99      0.90      0.91      0.82       389
              2       0.95      0.88      0.98      0.91      0.93      0.85       394
              3       0.80      0.73      0.96      0.76      0.83      0.68       251

    avg / total       0.87      0.85      0.95      0.85      0.90      0.80      1353


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 11.572 seconds)

**Estimated memory usage:**  313 MB


.. _sphx_glr_download_auto_examples_applications_plot_topic_classication.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_topic_classication.ipynb <plot_topic_classication.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_topic_classication.py <plot_topic_classication.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_topic_classication.zip <plot_topic_classication.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_