Problem Statement

Business Context

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries

In [90]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)
from sklearn import metrics

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

pd.set_option("display.float_format", lambda x: "%.3f" % x)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

pd.set_option("display.float_format", lambda x: "%.3f" % x)

import warnings

warnings.filterwarnings("ignore")

Loading the dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
df = pd.read_csv('/content/drive/MyDrive/DSBA/Model Tuning/ReneWind Project/Train.csv.csv') 
df_test = pd.read_csv('/content/drive/MyDrive/DSBA/Model Tuning/ReneWind Project/Test.csv.csv') 

Data Overview

  • Observations
  • Sanity checks
In [91]:
df.shape ## dimensions of the train data
Out[91]:
(20000, 41)
In [5]:
df_test.shape ##  dimensions of the test data
Out[5]:
(5000, 41)
In [6]:
## Creating copy of training data

data = df.copy()
In [7]:
## Creating copy of test data

data_test = df_test.copy()
In [8]:
data.head() ##  top 5 rows of training data
Out[8]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.465 -4.679 3.102 0.506 -0.221 -2.033 -2.911 0.051 -1.522 3.762 -5.715 0.736 0.981 1.418 -3.376 -3.047 0.306 2.914 2.270 4.395 -2.388 0.646 -1.191 3.133 0.665 -2.511 -0.037 0.726 -3.982 -1.073 1.667 3.060 -1.690 2.846 2.235 6.667 0.444 -2.369 2.951 -3.480 0
1 3.366 3.653 0.910 -1.368 0.332 2.359 0.733 -4.332 0.566 -0.101 1.914 -0.951 -1.255 -2.707 0.193 -4.769 -2.205 0.908 0.757 -5.834 -3.065 1.597 -1.757 1.766 -0.267 3.625 1.500 -0.586 0.783 -0.201 0.025 -1.795 3.033 -2.468 1.895 -2.298 -1.731 5.909 -0.386 0.616 0
2 -3.832 -5.824 0.634 -2.419 -1.774 1.017 -2.099 -3.173 -2.082 5.393 -0.771 1.107 1.144 0.943 -3.164 -4.248 -4.039 3.689 3.311 1.059 -2.143 1.650 -1.661 1.680 -0.451 -4.551 3.739 1.134 -2.034 0.841 -1.600 -0.257 0.804 4.086 2.292 5.361 0.352 2.940 3.839 -4.309 0
3 1.618 1.888 7.046 -1.147 0.083 -1.530 0.207 -2.494 0.345 2.119 -3.053 0.460 2.705 -0.636 -0.454 -3.174 -3.404 -1.282 1.582 -1.952 -3.517 -1.206 -5.628 -1.818 2.124 5.295 4.748 -2.309 -3.963 -6.029 4.949 -3.584 -2.577 1.364 0.623 5.550 -1.527 0.139 3.101 -1.277 0
4 -0.111 3.872 -3.758 -2.983 3.793 0.545 0.205 4.849 -1.855 -6.220 1.998 4.724 0.709 -1.989 -2.633 4.184 2.245 3.734 -6.313 -5.380 -0.887 2.062 9.446 4.490 -3.945 4.582 -8.780 -3.383 5.107 6.788 2.044 8.266 6.629 -10.069 1.223 -3.230 1.687 -2.164 -3.645 6.510 0
In [9]:
data_test.tail() ##  last 5 rows of test data
Out[9]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
4995 -5.120 1.635 1.251 4.036 3.291 -2.932 -1.329 1.754 -2.985 1.249 -6.878 3.715 -2.512 -1.395 -2.554 -2.197 4.772 2.403 3.792 0.487 -2.028 1.778 3.668 11.375 -1.977 2.252 -7.319 1.907 -3.734 -0.012 2.120 9.979 0.063 0.217 3.036 2.109 -0.557 1.939 0.513 -2.694 0
4996 -5.172 1.172 1.579 1.220 2.530 -0.669 -2.618 -2.001 0.634 -0.579 -3.671 0.460 3.321 -1.075 -7.113 -4.356 -0.001 3.698 -0.846 -0.222 -3.645 0.736 0.926 3.278 -2.277 4.458 -4.543 -1.348 -1.779 0.352 -0.214 4.424 2.604 -2.152 0.917 2.157 0.467 0.470 2.197 -2.377 0
4997 -1.114 -0.404 -1.765 -5.879 3.572 3.711 -2.483 -0.308 -0.922 -2.999 -0.112 -1.977 -1.623 -0.945 -2.735 -0.813 0.610 8.149 -9.199 -3.872 -0.296 1.468 2.884 2.792 -1.136 1.198 -4.342 -2.869 4.124 4.197 3.471 3.792 7.482 -10.061 -0.387 1.849 1.818 -1.246 -1.261 7.475 0
4998 -1.703 0.615 6.221 -0.104 0.956 -3.279 -1.634 -0.104 1.388 -1.066 -7.970 2.262 3.134 -0.486 -3.498 -4.562 3.136 2.536 -0.792 4.398 -4.073 -0.038 -2.371 -1.542 2.908 3.215 -0.169 -1.541 -4.724 -5.525 1.668 -4.100 -5.949 0.550 -1.574 6.824 2.139 -4.036 3.436 0.579 0
4999 -0.604 0.960 -0.721 8.230 -1.816 -2.276 -2.575 -1.041 4.130 -2.731 -3.292 -1.674 0.465 -1.646 -5.263 -7.988 6.480 0.226 4.963 6.752 -6.306 3.271 1.897 3.271 -0.637 -0.925 -6.759 2.990 -0.814 3.499 -8.435 2.370 -1.062 0.791 4.952 -7.441 -0.070 -0.918 -2.291 -5.363 0
In [10]:
## data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

There are 41 numeric (float and int) columns in the data.

In [11]:
data.duplicated().sum() ## duplicate entries in the data
Out[11]:
0

There are no duplicate values in the dataset.

In [12]:
data.isnull().sum() ## missing entries in the train data
Out[12]:
V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64
In [13]:
data_test.isnull().sum() ## missing entries in the test data
Out[13]:
V1        5
V2        6
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64

There are missing values in train and test data.

In [14]:
data.describe(include="all") ## statitical summary of the train data
Out[14]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
count 19982.000 19982.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000 20000.000
mean -0.272 0.440 2.485 -0.083 -0.054 -0.995 -0.879 -0.548 -0.017 -0.013 -1.895 1.605 1.580 -0.951 -2.415 -2.925 -0.134 1.189 1.182 0.024 -3.611 0.952 -0.366 1.134 -0.002 1.874 -0.612 -0.883 -0.986 -0.016 0.487 0.304 0.050 -0.463 2.230 1.515 0.011 -0.344 0.891 -0.876 0.056
std 3.442 3.151 3.389 3.432 2.105 2.041 1.762 3.296 2.161 2.193 3.124 2.930 2.875 1.790 3.355 4.222 3.345 2.592 3.397 3.669 3.568 1.652 4.032 3.912 2.017 3.435 4.369 1.918 2.684 3.005 3.461 5.500 3.575 3.184 2.937 3.801 1.788 3.948 1.753 3.012 0.229
min -11.876 -12.320 -10.708 -15.082 -8.603 -10.227 -7.950 -15.658 -8.596 -9.854 -14.832 -12.948 -13.228 -7.739 -16.417 -20.374 -14.091 -11.644 -13.492 -13.923 -17.956 -10.122 -14.866 -16.387 -8.228 -11.834 -14.905 -9.269 -12.579 -14.796 -13.723 -19.877 -16.898 -17.985 -15.350 -14.833 -5.478 -17.375 -6.439 -11.024 0.000
25% -2.737 -1.641 0.207 -2.348 -1.536 -2.347 -2.031 -2.643 -1.495 -1.411 -3.922 -0.397 -0.224 -2.171 -4.415 -5.634 -2.216 -0.404 -1.050 -2.433 -5.930 -0.118 -3.099 -1.468 -1.365 -0.338 -3.652 -2.171 -2.787 -1.867 -1.818 -3.420 -2.243 -2.137 0.336 -0.944 -1.256 -2.988 -0.272 -2.940 0.000
50% -0.748 0.472 2.256 -0.135 -0.102 -1.001 -0.917 -0.389 -0.068 0.101 -1.921 1.508 1.637 -0.957 -2.383 -2.683 -0.015 0.883 1.279 0.033 -3.533 0.975 -0.262 0.969 0.025 1.951 -0.885 -0.891 -1.176 0.184 0.490 0.052 -0.066 -0.255 2.099 1.567 -0.128 -0.317 0.919 -0.921 0.000
75% 1.840 2.544 4.566 2.131 1.340 0.380 0.224 1.723 1.409 1.477 0.119 3.571 3.460 0.271 -0.359 -0.095 2.069 2.572 3.493 2.512 -1.266 2.026 2.452 3.546 1.397 4.130 2.189 0.376 0.630 2.036 2.731 3.762 2.255 1.437 4.064 3.984 1.176 2.279 2.058 1.120 0.000
max 15.493 13.089 17.091 13.236 8.134 6.976 8.006 11.679 8.138 8.108 11.826 15.081 15.420 5.671 12.246 13.583 16.756 13.180 13.238 16.052 13.840 7.410 14.459 17.163 8.223 16.836 17.560 6.528 10.722 12.506 17.255 23.633 16.692 14.358 15.291 19.330 7.467 15.290 7.760 10.654 1.000

V36 column has the maximum value.

V16 column has the lowest value.

Exploratory Data Analysis (EDA)

Plotting histograms and boxplots for all the variables

In [15]:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Plotting all the features at one go

In [16]:
for feature in df.columns:
    histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) 
In [17]:
data["Target"].value_counts() ## checking the class distribution in target variable for train data
Out[17]:
0    18890
1     1110
Name: Target, dtype: int64
In [18]:
data_test["Target"].value_counts() ## checking the class distribution in target variable for test data
Out[18]:
0    4718
1     282
Name: Target, dtype: int64

Data Pre-processing

In [19]:
## Dividing train data into X and y 
X = data.drop(["Target"], axis=1)
y = data["Target"]
In [20]:
# Splitting train dataset into training and validation set

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)
In [21]:
X_train.shape  # dimensions of the X_train data
Out[21]:
(15000, 40)
In [22]:
X_val.shape  # dimensions of the X_val data
Out[22]:
(5000, 40)
In [23]:
# Dividing test data into X_test and y_test

X_test = data_test.drop(["Target"], axis=1)         
y_test = data_test["Target"]           
In [24]:
X_test.shape # dimensions of the X_test data
Out[24]:
(5000, 40)

Missing value imputation

Missing values treatment

In [25]:
imputer = SimpleImputer(strategy="median")
In [26]:
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
In [27]:
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)
In [28]:
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_train.columns)
In [29]:
print(X_train.isna().sum())
print("-" * 30)

print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64

Missing values have been treated.

Model Building

Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [30]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1
            
        },
        index=[0],
    )

    return df_perf

Defining scorer to be used for cross-validation and hyperparameter tuning

  • We want to reduce false negatives and will try to maximize "Recall".
  • To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
In [31]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model Building with original data

In [32]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

Logistic regression: 0.4927566553639709
Bagging: 0.7210807301060529
Random forest: 0.7235192266070268
GBM: 0.7066661857008874
Adaboost: 0.6309140754635308
dtree: 0.6982829521679532

Validation Performance:

Logistic regression: 0.48201438848920863
Bagging: 0.7302158273381295
Random forest: 0.7266187050359713
GBM: 0.7230215827338129
Adaboost: 0.6762589928057554
dtree: 0.7050359712230215
In [33]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.

Model Building with Oversampled data

In [34]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [35]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))

print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 832
Before OverSampling, counts of label '0': 14168 

After OverSampling, counts of label '1': 14168
After OverSampling, counts of label '0': 14168 

After OverSampling, the shape of train_X: (28336, 40)
After OverSampling, the shape of train_y: (28336,) 

In [36]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )  
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

Logistic regression: 0.4927566553639709
Bagging: 0.7210807301060529
Random forest: 0.7235192266070268
GBM: 0.7066661857008874
Adaboost: 0.6309140754635308
dtree: 0.6982829521679532

Validation Performance:

Logistic regression: 0.48201438848920863
Bagging: 0.7302158273381295
Random forest: 0.7266187050359713
GBM: 0.7230215827338129
Adaboost: 0.6762589928057554
dtree: 0.7050359712230215
In [37]:
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.

We will tune Adaboost and GBM models.

Model Building with Undersampled data

In [38]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [39]:
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))


print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))


print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 832
Before UnderSampling, counts of label '0': 14168 

After UnderSampling, counts of label '1': 832
After UnderSampling, counts of label '0': 832 

After UnderSampling, the shape of train_X: (1664, 40)
After UnderSampling, the shape of train_y: (1664,) 

In [40]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )  
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

Logistic regression: 0.4927566553639709
Bagging: 0.7210807301060529
Random forest: 0.7235192266070268
GBM: 0.7066661857008874
Adaboost: 0.6309140754635308
dtree: 0.6982829521679532

Validation Performance:

Logistic regression: 0.48201438848920863
Bagging: 0.7302158273381295
Random forest: 0.7266187050359713
GBM: 0.7230215827338129
Adaboost: 0.6762589928057554
dtree: 0.7050359712230215
In [41]:
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.

We will tune Random forest model.

HyperparameterTuning

Tuning AdaBoost using oversampled data

Randomsearch CV

In [42]:
%%time 

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [100, 150, 200],
    "learning_rate": [0.2, 0.05],
    "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
    ]
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over) 

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9715559462639259:
CPU times: user 2min 6s, sys: 2.45 s, total: 2min 8s
Wall time: 48min 42s
In [43]:
tuned_ada = AdaBoostClassifier(
    n_estimators= 200, learning_rate= 0.2, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
) 

tuned_ada.fit(X_train_over,y_train_over) 
Out[43]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=200)
In [59]:
ada_train_perf = model_performance_classification_sklearn(
    tuned_ada, X_train_over, y_train_over
)
ada_train_perf
Out[59]:
Accuracy Recall Precision F1
0 0.992 0.988 0.995 0.992
In [45]:
ada_val_perf = model_performance_classification_sklearn(
    tuned_ada, X_val, y_val
) 
ada_val_perf
Out[45]:
Accuracy Recall Precision F1
0 0.979 0.849 0.789 0.818

The validation recall is not similar to cross-validated recall. The tuned Adaboost model fits the training data

Ada Boost - GridSearch

In [64]:
%%time 

# defining model
model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in GridSearchCV

param_grid = {
    "n_estimators": np.arange(100, 150, 200),
    "learning_rate": [0.2, 0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)

# Fitting parameters in GridSearchCV
grid_cv.fit(X_train_over, y_train_over)

print(
    "Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'learning_rate': 0.2, 'n_estimators': 100} 
Score: 0.949393465111882
CPU times: user 58.1 s, sys: 549 ms, total: 58.6 s
Wall time: 11min 37s
In [68]:
adb_tuned1 = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=0.2,
    random_state=1,
    base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)

# Fit the model on training data
adb_tuned1.fit(X_train_over, y_train_over)
Out[68]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=100, random_state=1)
In [69]:
Adaboost_grid_train = model_performance_classification_sklearn(
    adb_tuned1, X_train_over, y_train_over
)
print("Training performance:")
Adaboost_grid_train
Training performance:
Out[69]:
Accuracy Recall Precision F1
0 0.949 0.926 0.972 0.948
In [70]:
Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val
Validation performance:
Out[70]:
Accuracy Recall Precision F1
0 0.959 0.860 0.590 0.700

The validation recall is not similar to cross-validated recall. The tuned Adaboost model fits the training data

Tuning Random forest using undersampled data

In [62]:
%%time 

# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)}


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un) ## Complete the code to fit the model on under sampled data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8990116153235697:
CPU times: user 5.75 s, sys: 198 ms, total: 5.95 s
Wall time: 1min 51s
In [65]:
# Creating new pipeline with best parameters
tuned_rf2 = RandomForestClassifier(
    max_features='sqrt',
    random_state=1,
    max_samples=0.6,
    n_estimators=250,
    min_samples_leaf=1,
)

tuned_rf2.fit(X_train_un,y_train_un)
Out[65]:
RandomForestClassifier(max_features='sqrt', max_samples=0.6, n_estimators=250,
                       random_state=1)
In [66]:
rf2_train_perf = model_performance_classification_sklearn(
    tuned_rf2, X_train_un, y_train_un
)
rf2_train_perf
Out[66]:
Accuracy Recall Precision F1
0 0.988 0.977 0.999 0.988
In [49]:
rf2_val_perf = model_performance_classification_sklearn(
    tuned_rf2, X_val, y_val
) 
rf2_val_perf
Out[49]:
Accuracy Recall Precision F1
0 0.983 0.712 0.985 0.827

The validation recall is similar to cross-validated recall. The tuned Random forest model overfitting the training data.

Tuning Gradient boosting using oversampled data

In [50]:
%%time 

# defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 1} with CV score=0.9723322092856124:
CPU times: user 28.1 s, sys: 1.06 s, total: 29.1 s
Wall time: 23min 16s
In [51]:
tuned_gbm = GradientBoostingClassifier(
    max_features=0.5,
    random_state=1,
    learning_rate=1,
    n_estimators=125,
    subsample=0.7,
)

tuned_gbm.fit(X_train_over, y_train_over)
Out[51]:
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           random_state=1, subsample=0.7)
In [61]:
gbm_train_perf = model_performance_classification_sklearn(
    tuned_gbm, X_train_over, y_train_over
 ) 
gbm_train_perf
Out[61]:
Accuracy Recall Precision F1
0 0.993 0.992 0.994 0.993
In [53]:
gbm_val_perf = model_performance_classification_sklearn(
    tuned_gbm, X_val, y_val
)
gbm_val_perf
Out[53]:
Accuracy Recall Precision F1
0 0.969 0.856 0.678 0.757

The validation recall is not similar to cross-validated recall. The tuned Gradient boosting model fits the training data

Model performance comparison and choosing the final model

In [75]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        gbm_train_perf.T,
        ada_train_perf.T,
        Adaboost_grid_train.T,
        rf2_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Gradient Boosting tuned with oversampled data",
    "AdaBoost classifier Random Search tuned with oversampled data",
    "AdaBoost classifier Grid Search tuned with oversampled data",
    "Random forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[75]:
Gradient Boosting tuned with oversampled data AdaBoost classifier Random Search tuned with oversampled data AdaBoost classifier Grid Search tuned with oversampled data Random forest tuned with undersampled data
Accuracy 0.993 0.992 0.949 0.988
Recall 0.992 0.988 0.926 0.977
Precision 0.994 0.995 0.972 0.999
F1 0.993 0.992 0.948 0.988
In [77]:
# validation performance comparison

models_val_comp_df = pd.concat(
    [
        gbm_val_perf.T,
        ada_val_perf.T,
        Adaboost_grid_train.T,
        rf2_val_perf.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Gradient Boosting tuned with oversampled data",
    "AdaBoost classifier tuned with oversampled data",
    "AdaBoost classifier Grid Search tuned with oversampled data",
    "Random forest tuned with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
Out[77]:
Gradient Boosting tuned with oversampled data AdaBoost classifier tuned with oversampled data AdaBoost classifier Grid Search tuned with oversampled data Random forest tuned with undersampled data
Accuracy 0.969 0.979 0.949 0.983
Recall 0.856 0.849 0.926 0.712
Precision 0.678 0.789 0.972 0.985
F1 0.757 0.818 0.948 0.827
  • The AdaBoost model tuned using Random search is giving the best validation recall of 0.85 but it has the low values of validation precision.
  • Let's check the model's performance on test set and then see the feature importance from the tuned Adaboost model.

Test set final performance

In [56]:
# Let's check the performance on test set
ada_grid_test = model_performance_classification_sklearn(tuned_ada, X_test, y_test)
print("Test performance:")
ada_grid_test
Test performance:
Out[56]:
Accuracy Recall Precision F1
0 0.978 0.844 0.785 0.814
  • The performance on test data is good.
  • Let's check the important features for prediction as per the the final model

Feature Importances

In [86]:
feature_names = X_train.columns
importances =  tuned_ada.feature_importances_ 
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

V30 is the most important feature followed by V9 and V18.

Pipelines to build the final model

In [87]:
Pipeline_model = Pipeline([('imputer', SimpleImputer()), ('AdaBoost', AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=200)

)])

model.fit(X_train, y_train)
Out[87]:
AdaBoostClassifier(random_state=1)
In [81]:
# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]

# Since we already have a separate test set, we don't need to divide data into train and test

X_test1 = df_test.drop(["Target"], axis=1) 
y_test1 = df_test["Target"] 
In [82]:
# We can't oversample/undersample data without doing missing value treatment, so let's first treat the missing values in the train set
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)

Best model is built on the oversampled data.

In [83]:
# # Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_over1, y_over1 = sm.fit_resample(X1, Y1)
In [88]:
Pipeline_model.fit(X_train, y_train) 
Out[88]:
Pipeline(steps=[('imputer', SimpleImputer()),
                ('AdaBoost',
                 AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                                          random_state=1),
                                    learning_rate=0.2, n_estimators=200))])
In [89]:
Pipeline_model_test = model_performance_classification_sklearn(Pipeline_model, X_test, y_test)  
Pipeline_model_test
Out[89]:
Accuracy Recall Precision F1
0 0.985 0.762 0.964 0.851

Business Insights and Conclusions

Adaboost classifier for over sampled data using random search is the best classifier model.

V30, V9 and V18 are the best features.