.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/applications/plot_impact_imbalanced_classes.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_applications_plot_impact_imbalanced_classes.py: ========================================================== Fitting model on imbalanced datasets and how to fight bias ========================================================== This example illustrates the problem induced by learning on datasets having imbalanced classes. Subsequently, we compare different approaches alleviating these negative effects. .. GENERATED FROM PYTHON SOURCE LINES 10-14 .. code-block:: Python # Authors: Guillaume Lemaitre # License: MIT .. GENERATED FROM PYTHON SOURCE LINES 15-17 .. code-block:: Python print(__doc__) .. GENERATED FROM PYTHON SOURCE LINES 18-27 Problem definition ------------------ We are dropping the following features: - "fnlwgt": this feature was created while studying the "adult" dataset. Thus, we will not use this feature which is not acquired during the survey. - "education-num": it is encoding the same information than "education". Thus, we are removing one of these 2 features. .. GENERATED FROM PYTHON SOURCE LINES 29-34 .. code-block:: Python from sklearn.datasets import fetch_openml df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True) df = df.drop(columns=["fnlwgt", "education-num"]) .. GENERATED FROM PYTHON SOURCE LINES 35-36 The "adult" dataset as a class ratio of about 3:1 .. GENERATED FROM PYTHON SOURCE LINES 38-41 .. code-block:: Python classes_count = y.value_counts() classes_count .. rst-class:: sphx-glr-script-out .. code-block:: none class <=50K 37155 >50K 11687 Name: count, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 42-44 This dataset is only slightly imbalanced. To better highlight the effect of learning from an imbalanced dataset, we will increase its ratio to 30:1 .. GENERATED FROM PYTHON SOURCE LINES 46-56 .. code-block:: Python from imblearn.datasets import make_imbalance ratio = 30 df_res, y_res = make_imbalance( df, y, sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio}, ) y_res.value_counts() .. rst-class:: sphx-glr-script-out .. code-block:: none class <=50K 37155 >50K 1238 Name: count, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 57-62 We will perform a cross-validation evaluation to get an estimate of the test score. As a baseline, we could use a classifier which will always predict the majority class independently of the features provided. .. GENERATED FROM PYTHON SOURCE LINES 62-65 .. code-block:: Python from sklearn.dummy import DummyClassifier .. GENERATED FROM PYTHON SOURCE LINES 66-73 .. code-block:: Python from sklearn.model_selection import cross_validate dummy_clf = DummyClassifier(strategy="most_frequent") scoring = ["accuracy", "balanced_accuracy"] cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring) print(f"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy score of a dummy classifier: 0.968 .. GENERATED FROM PYTHON SOURCE LINES 74-76 Instead of using the accuracy, we can use the balanced accuracy which will take into account the balancing issue. .. GENERATED FROM PYTHON SOURCE LINES 78-83 .. code-block:: Python print( f"Balanced accuracy score of a dummy classifier: " f"{cv_result['test_balanced_accuracy'].mean():.3f}" ) .. rst-class:: sphx-glr-script-out .. code-block:: none Balanced accuracy score of a dummy classifier: 0.500 .. GENERATED FROM PYTHON SOURCE LINES 84-88 Strategies to learn from an imbalanced dataset ---------------------------------------------- We will use a dictionary and a list to continuously store the results of our experiments and show them as a pandas dataframe. .. GENERATED FROM PYTHON SOURCE LINES 90-93 .. code-block:: Python index = [] scores = {"Accuracy": [], "Balanced accuracy": []} .. GENERATED FROM PYTHON SOURCE LINES 94-99 Dummy baseline .............. Before to train a real machine learning model, we can store the results obtained with our :class:`~sklearn.dummy.DummyClassifier`. .. GENERATED FROM PYTHON SOURCE LINES 101-111 .. code-block:: Python import pandas as pd index += ["Dummy classifier"] cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.5


.. GENERATED FROM PYTHON SOURCE LINES 112-122 Linear classifier baseline .......................... We will create a machine learning pipeline using a :class:`~sklearn.linear_model.LogisticRegression` classifier. In this regard, we will need to one-hot encode the categorical columns and standardized the numerical columns before to inject the data into the :class:`~sklearn.linear_model.LogisticRegression` classifier. First, we define our numerical and categorical pipelines. .. GENERATED FROM PYTHON SOURCE LINES 124-136 .. code-block:: Python from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler num_pipe = make_pipeline( StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True) ) cat_pipe = make_pipeline( SimpleImputer(strategy="constant", fill_value="missing"), OneHotEncoder(handle_unknown="ignore"), ) .. GENERATED FROM PYTHON SOURCE LINES 137-140 Then, we can create a preprocessor which will dispatch the categorical columns to the categorical pipeline and the numerical columns to the numerical pipeline .. GENERATED FROM PYTHON SOURCE LINES 142-151 .. code-block:: Python from sklearn.compose import make_column_selector as selector from sklearn.compose import make_column_transformer preprocessor_linear = make_column_transformer( (num_pipe, selector(dtype_include="number")), (cat_pipe, selector(dtype_include="category")), n_jobs=2, ) .. GENERATED FROM PYTHON SOURCE LINES 152-155 Finally, we connect our preprocessor with our :class:`~sklearn.linear_model.LogisticRegression`. We can then evaluate our model. .. GENERATED FROM PYTHON SOURCE LINES 157-161 .. code-block:: Python from sklearn.linear_model import LogisticRegression lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000)) .. GENERATED FROM PYTHON SOURCE LINES 162-170 .. code-block:: Python index += ["Logistic regression"] cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094


.. GENERATED FROM PYTHON SOURCE LINES 171-178 We can see that our linear model is learning slightly better than our dummy baseline. However, it is impacted by the class imbalance. We can verify that something similar is happening with a tree-based model such as :class:`~sklearn.ensemble.RandomForestClassifier`. With this type of classifier, we will not need to scale the numerical data, and we will only need to ordinal encode the categorical data. .. GENERATED FROM PYTHON SOURCE LINES 178-181 .. code-block:: Python from sklearn.ensemble import RandomForestClassifier .. GENERATED FROM PYTHON SOURCE LINES 182-200 .. code-block:: Python from sklearn.preprocessing import OrdinalEncoder num_pipe = SimpleImputer(strategy="mean", add_indicator=True) cat_pipe = make_pipeline( SimpleImputer(strategy="constant", fill_value="missing"), OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1), ) preprocessor_tree = make_column_transformer( (num_pipe, selector(dtype_include="number")), (cat_pipe, selector(dtype_include="category")), n_jobs=2, ) rf_clf = make_pipeline( preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2) ) .. GENERATED FROM PYTHON SOURCE LINES 201-209 .. code-block:: Python index += ["Random forest"] cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934


.. GENERATED FROM PYTHON SOURCE LINES 210-224 The :class:`~sklearn.ensemble.RandomForestClassifier` is as well affected by the class imbalanced, slightly less than the linear model. Now, we will present different approach to improve the performance of these 2 models. Use `class_weight` .................. Most of the models in `scikit-learn` have a parameter `class_weight`. This parameter will affect the computation of the loss in linear model or the criterion in the tree-based model to penalize differently a false classification from the minority and majority class. We can set `class_weight="balanced"` such that the weight applied is inversely proportional to the class frequency. We test this parametrization in both linear model and tree-based model. .. GENERATED FROM PYTHON SOURCE LINES 226-236 .. code-block:: Python lr_clf.set_params(logisticregression__class_weight="balanced") index += ["Logistic regression with balanced class weights"] cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934
Logistic regression with balanced class weights 0.808142 0.823958


.. GENERATED FROM PYTHON SOURCE LINES 237-247 .. code-block:: Python rf_clf.set_params(randomforestclassifier__class_weight="balanced") index += ["Random forest with balanced class weights"] cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934
Logistic regression with balanced class weights 0.808142 0.823958
Random forest with balanced class weights 0.964212 0.632087


.. GENERATED FROM PYTHON SOURCE LINES 248-260 We can see that using `class_weight` was really effective for the linear model, alleviating the issue of learning from imbalanced classes. However, the :class:`~sklearn.ensemble.RandomForestClassifier` is still biased toward the majority class, mainly due to the criterion which is not suited enough to fight the class imbalance. Resample the training set during learning ......................................... Another way is to resample the training set by under-sampling or over-sampling some of the samples. `imbalanced-learn` provides some samplers to do such processing. .. GENERATED FROM PYTHON SOURCE LINES 262-271 .. code-block:: Python from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler from imblearn.under_sampling import RandomUnderSampler lr_clf = make_pipeline_with_sampler( preprocessor_linear, RandomUnderSampler(random_state=42), LogisticRegression(max_iter=1000), ) .. GENERATED FROM PYTHON SOURCE LINES 272-280 .. code-block:: Python index += ["Under-sampling + Logistic regression"] cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934
Logistic regression with balanced class weights 0.808142 0.823958
Random forest with balanced class weights 0.964212 0.632087
Under-sampling + Logistic regression 0.801032 0.825345


.. GENERATED FROM PYTHON SOURCE LINES 281-287 .. code-block:: Python rf_clf = make_pipeline_with_sampler( preprocessor_tree, RandomUnderSampler(random_state=42), RandomForestClassifier(random_state=42, n_jobs=2), ) .. GENERATED FROM PYTHON SOURCE LINES 288-296 .. code-block:: Python index += ["Under-sampling + Random forest"] cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934
Logistic regression with balanced class weights 0.808142 0.823958
Random forest with balanced class weights 0.964212 0.632087
Under-sampling + Logistic regression 0.801032 0.825345
Under-sampling + Random forest 0.799651 0.810557


.. GENERATED FROM PYTHON SOURCE LINES 297-316 Applying a random under-sampler before the training of the linear model or random forest, allows to not focus on the majority class at the cost of making more mistake for samples in the majority class (i.e. decreased accuracy). We could apply any type of samplers and find which sampler is working best on the current dataset. Instead, we will present another way by using classifiers which will apply sampling internally. Use of specific balanced algorithms from imbalanced-learn ......................................................... We already showed that random under-sampling can be effective on decision tree. However, instead of under-sampling once the dataset, one could under-sample the original dataset before to take a bootstrap sample. This is the base of the :class:`imblearn.ensemble.BalancedRandomForestClassifier` and :class:`~imblearn.ensemble.BalancedBaggingClassifier`. .. GENERATED FROM PYTHON SOURCE LINES 318-331 .. code-block:: Python from imblearn.ensemble import BalancedRandomForestClassifier rf_clf = make_pipeline( preprocessor_tree, BalancedRandomForestClassifier( sampling_strategy="all", replacement=True, bootstrap=False, random_state=42, n_jobs=2, ), ) .. GENERATED FROM PYTHON SOURCE LINES 332-340 .. code-block:: Python index += ["Balanced random forest"] cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934
Logistic regression with balanced class weights 0.808142 0.823958
Random forest with balanced class weights 0.964212 0.632087
Under-sampling + Logistic regression 0.801032 0.825345
Under-sampling + Random forest 0.799651 0.810557
Balanced random forest 0.853671 0.808419


.. GENERATED FROM PYTHON SOURCE LINES 341-345 The performance with the :class:`~imblearn.ensemble.BalancedRandomForestClassifier` is better than applying a single random under-sampling. We will use a gradient-boosting classifier within a :class:`~imblearn.ensemble.BalancedBaggingClassifier`. .. GENERATED FROM PYTHON SOURCE LINES 345-368 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingClassifier from imblearn.ensemble import BalancedBaggingClassifier bag_clf = make_pipeline( preprocessor_tree, BalancedBaggingClassifier( estimator=HistGradientBoostingClassifier(random_state=42), n_estimators=10, random_state=42, n_jobs=2, ), ) index += ["Balanced bag of histogram gradient boosting"] cv_result = cross_validate(bag_clf, df_res, y_res, scoring=scoring) scores["Accuracy"].append(cv_result["test_accuracy"].mean()) scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean()) df_scores = pd.DataFrame(scores, index=index) df_scores .. raw:: html
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.970333 0.568094
Random forest 0.970281 0.630934
Logistic regression with balanced class weights 0.808142 0.823958
Random forest with balanced class weights 0.964212 0.632087
Under-sampling + Logistic regression 0.801032 0.825345
Under-sampling + Random forest 0.799651 0.810557
Balanced random forest 0.853671 0.808419
Balanced bag of histogram gradient boosting 0.837080 0.829523


.. GENERATED FROM PYTHON SOURCE LINES 369-372 This last approach is the most effective. The different under-sampling allows to bring some diversity for the different GBDT to learn and not focus on a portion of the majority class. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 40.331 seconds) **Estimated memory usage:** 87 MB .. _sphx_glr_download_auto_examples_applications_plot_impact_imbalanced_classes.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_impact_imbalanced_classes.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_impact_imbalanced_classes.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_