Example of topic classification in text documents
=================================================

This example shows how to balance the text data before to train a classifier.

Note that for this example, the data are slightly imbalanced but it can happen that for some data sets, the imbalanced ratio is more significant.

# Authors: Guillaume Lemaitre
# License: MIT GENERATED FROM PYTHON SOURCE LINES 16-18 .. code-block:: Python print(__doc__) .. GENERATED FROM PYTHON SOURCE LINES 19-27 Setting the data set -------------------- We use a part of the 20 newsgroups data set by loading 4 topics. Using the scikit-learn loader, the data are split into a training and a testing set. Note the class \#3 is the minority class and has almost twice less samples than the majority class. .. GENERATED FROM PYTHON SOURCE LINES 29-46 .. code-block:: Python from sklearn.datasets import fetch_20newsgroups categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space", ] newsgroups_train = fetch_20newsgroups(subset="train", categories=categories) newsgroups_test = fetch_20newsgroups(subset="test", categories=categories) X_train = newsgroups_train.data X_test = newsgroups_test.data y_train = newsgroups_train.target y_test = newsgroups_test.target .. GENERATED FROM PYTHON SOURCE LINES 47-52 .. code-block:: Python from collections import Counter print(f"Training class distributions summary: {Counter(y_train)}") print(f"Test class distributions summary: {Counter(y_test)}") .. rst-class:: sphx-glr-script-out .. code-block:: none Training class distributions summary: Counter({2: 593, 1: 584, 0: 480, 3: 377}) Test class distributions summary: Counter({2: 394, 1: 389, 0: 319, 3: 251}) .. GENERATED FROM PYTHON SOURCE LINES 53-62 The usual scikit-learn pipeline ------------------------------- You might usually use scikit-learn pipeline by combining the TF-IDF vectorizer to feed a multinomial naive bayes classifier. A classification report summarized the results on the testing set. As expected, the recall of the class \#3 is low mainly due to the class imbalanced. .. GENERATED FROM PYTHON SOURCE LINES 64-72 .. code-block:: Python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline model = make_pipeline(TfidfVectorizer(), MultinomialNB()) model.fit(X_train, y_train) y_pred = model.predict(X_test) .. GENERATED FROM PYTHON SOURCE LINES 73-77 .. code-block:: Python from imblearn.metrics import classification_report_imbalanced print(classification_report_imbalanced(y_test, y_pred)) .. rst-class:: sphx-glr-script-out .. code-block:: none pre rec spe f1 geo iba sup 0 0.67 0.94 0.86 0.79 0.90 0.82 319 1 0.96 0.92 0.99 0.94 0.95 0.90 389 2 0.87 0.98 0.94 0.92 0.96 0.92 394 3 0.97 0.36 1.00 0.52 0.60 0.33 251 avg / total 0.87 0.84 0.94 0.82 0.88 0.78 1353 .. GENERATED FROM PYTHON SOURCE LINES 78-89 Balancing the class before classification ----------------------------------------- To improve the prediction of the class \#3, it could be interesting to apply a balancing before to train the naive bayes classifier. Therefore, we will use a :class:`~imblearn.under_sampling.RandomUnderSampler` to equalize the number of samples in all the classes before the training. It is also important to note that we are using the :class:`~imblearn.pipeline.make_pipeline` function implemented in imbalanced-learn to properly handle the samplers. .. GENERATED FROM PYTHON SOURCE LINES 89-92 .. code-block:: Python from imblearn.pipeline import make_pipeline as make_pipeline_imb .. GENERATED FROM PYTHON SOURCE LINES 93-100 .. code-block:: Python from imblearn.under_sampling import RandomUnderSampler model = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB()) model.fit(X_train, y_train) y_pred = model.predict(X_test) .. GENERATED FROM PYTHON SOURCE LINES 101-105 Although the results are almost identical, it can be seen that the resampling allowed to correct the poor recall of the class \#3 at the cost of reducing the other metrics for the other classes. However, the overall results are slightly better. .. print(classification_report_imbalanced(y_test, y_pred))

pre       rec       spe        f1       geo       iba       sup

          0       0.71      0.90      0.89      0.79      0.89      0.80       319
          1       0.97      0.85      0.99      0.91      0.92      0.83       389
          2       0.94      0.90      0.98      0.92      0.94      0.87       394
          3       0.79      0.74      0.95      0.76      0.84      0.69       251

avg / total       0.87      0.85      0.96      0.86      0.90      0.81      1353