Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

The data provided is a transformed version of original data which was collected using sensors.
Train.csv - To be used for training and tuning of models.
Test.csv - To be used only for testing the performance of the final best model.
Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)
from sklearn import metrics

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

pd.set_option("display.float_format", lambda x: "%.3f" % x)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

pd.set_option("display.float_format", lambda x: "%.3f" % x)

import warnings

warnings.filterwarnings("ignore")

Loading the dataset¶

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

df = pd.read_csv('/content/drive/MyDrive/DSBA/Model Tuning/ReneWind Project/Train.csv.csv') 
df_test = pd.read_csv('/content/drive/MyDrive/DSBA/Model Tuning/ReneWind Project/Test.csv.csv')

Data Overview¶

Observations
Sanity checks

df.shape ## dimensions of the train data

(20000, 41)

df_test.shape ##  dimensions of the test data

(5000, 41)

## Creating copy of training data

data = df.copy()

## Creating copy of test data

data_test = df_test.copy()

data.head() ##  top 5 rows of training data

data_test.tail() ##  last 5 rows of test data

## data types of the columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

There are 41 numeric (float and int) columns in the data.

data.duplicated().sum() ## duplicate entries in the data

0

There are no duplicate values in the dataset.

data.isnull().sum() ## missing entries in the train data

V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64

data_test.isnull().sum() ## missing entries in the test data

V1        5
V2        6
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64

There are missing values in train and test data.

data.describe(include="all") ## statitical summary of the train data

V36 column has the maximum value.

V16 column has the lowest value.

Exploratory Data Analysis (EDA)¶

Plotting histograms and boxplots for all the variables¶

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Plotting all the features at one go¶

for feature in df.columns:
    histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None)

data["Target"].value_counts() ## checking the class distribution in target variable for train data

0    18890
1     1110
Name: Target, dtype: int64

data_test["Target"].value_counts() ## checking the class distribution in target variable for test data

0    4718
1     282
Name: Target, dtype: int64

Data Pre-processing¶

## Dividing train data into X and y 
X = data.drop(["Target"], axis=1)
y = data["Target"]

# Splitting train dataset into training and validation set

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)

X_train.shape  # dimensions of the X_train data

(15000, 40)

X_val.shape  # dimensions of the X_val data

(5000, 40)

# Dividing test data into X_test and y_test

X_test = data_test.drop(["Target"], axis=1)         
y_test = data_test["Target"]

X_test.shape # dimensions of the X_test data

(5000, 40)

Missing value imputation¶

Missing values treatment

imputer = SimpleImputer(strategy="median")

X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)

X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)

X_test = pd.DataFrame(imputer.transform(X_test), columns=X_train.columns)

print(X_train.isna().sum())
print("-" * 30)

print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())

V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64

Missing values have been treated.

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

True positives (TP) are failures correctly predicted by the model.
False negatives (FN) are real failures in a generator where there is no detection by model.
False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1
            
        },
        index=[0],
    )

    return df_perf

Defining scorer to be used for cross-validation and hyperparameter tuning¶

We want to reduce false negatives and will try to maximize "Recall".
To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model Building with original data¶

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

Cross-Validation performance on training dataset:

Logistic regression: 0.4927566553639709
Bagging: 0.7210807301060529
Random forest: 0.7235192266070268
GBM: 0.7066661857008874
Adaboost: 0.6309140754635308
dtree: 0.6982829521679532

Validation Performance:

Logistic regression: 0.48201438848920863
Bagging: 0.7302158273381295
Random forest: 0.7266187050359713
GBM: 0.7230215827338129
Adaboost: 0.6762589928057554
dtree: 0.7050359712230215

# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.

Model Building with Oversampled data¶

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))

print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))

Before OverSampling, counts of label '1': 832
Before OverSampling, counts of label '0': 14168 

After OverSampling, counts of label '1': 14168
After OverSampling, counts of label '0': 14168 

After OverSampling, the shape of train_X: (28336, 40)
After OverSampling, the shape of train_y: (28336,)

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )  
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

Cross-Validation performance on training dataset:

Logistic regression: 0.4927566553639709
Bagging: 0.7210807301060529
Random forest: 0.7235192266070268
GBM: 0.7066661857008874
Adaboost: 0.6309140754635308
dtree: 0.6982829521679532

Validation Performance:

Logistic regression: 0.48201438848920863
Bagging: 0.7302158273381295
Random forest: 0.7266187050359713
GBM: 0.7230215827338129
Adaboost: 0.6762589928057554
dtree: 0.7050359712230215

fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.

We will tune Adaboost and GBM models.

Model Building with Undersampled data¶

# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))


print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))


print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))

Before UnderSampling, counts of label '1': 832
Before UnderSampling, counts of label '0': 14168 

After UnderSampling, counts of label '1': 832
After UnderSampling, counts of label '0': 832 

After UnderSampling, the shape of train_X: (1664, 40)
After UnderSampling, the shape of train_y: (1664,)

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )  
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

Cross-Validation performance on training dataset:

Logistic regression: 0.4927566553639709
Bagging: 0.7210807301060529
Random forest: 0.7235192266070268
GBM: 0.7066661857008874
Adaboost: 0.6309140754635308
dtree: 0.6982829521679532

Validation Performance:

Logistic regression: 0.48201438848920863
Bagging: 0.7302158273381295
Random forest: 0.7266187050359713
GBM: 0.7230215827338129
Adaboost: 0.6762589928057554
dtree: 0.7050359712230215

fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.

We will tune Random forest model.

HyperparameterTuning¶

Tuning AdaBoost using oversampled data

Randomsearch CV

%%time 

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [100, 150, 200],
    "learning_rate": [0.2, 0.05],
    "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
    ]
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over) 

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9715559462639259:
CPU times: user 2min 6s, sys: 2.45 s, total: 2min 8s
Wall time: 48min 42s

tuned_ada = AdaBoostClassifier(
    n_estimators= 200, learning_rate= 0.2, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
) 

tuned_ada.fit(X_train_over,y_train_over)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=200)

ada_train_perf = model_performance_classification_sklearn(
    tuned_ada, X_train_over, y_train_over
)
ada_train_perf

ada_val_perf = model_performance_classification_sklearn(
    tuned_ada, X_val, y_val
) 
ada_val_perf

The validation recall is not similar to cross-validated recall. The tuned Adaboost model fits the training data

Ada Boost - GridSearch

%%time 

# defining model
model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in GridSearchCV

param_grid = {
    "n_estimators": np.arange(100, 150, 200),
    "learning_rate": [0.2, 0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)

# Fitting parameters in GridSearchCV
grid_cv.fit(X_train_over, y_train_over)

print(
    "Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)

Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'learning_rate': 0.2, 'n_estimators': 100} 
Score: 0.949393465111882
CPU times: user 58.1 s, sys: 549 ms, total: 58.6 s
Wall time: 11min 37s

adb_tuned1 = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=0.2,
    random_state=1,
    base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)

# Fit the model on training data
adb_tuned1.fit(X_train_over, y_train_over)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=100, random_state=1)

Adaboost_grid_train = model_performance_classification_sklearn(
    adb_tuned1, X_train_over, y_train_over
)
print("Training performance:")
Adaboost_grid_train

Training performance:

Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val

Validation performance:

The validation recall is not similar to cross-validated recall. The tuned Adaboost model fits the training data

Tuning Random forest using undersampled data

%%time 

# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)}


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un) ## Complete the code to fit the model on under sampled data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'n_estimators': 300, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8990116153235697:
CPU times: user 5.75 s, sys: 198 ms, total: 5.95 s
Wall time: 1min 51s

# Creating new pipeline with best parameters
tuned_rf2 = RandomForestClassifier(
    max_features='sqrt',
    random_state=1,
    max_samples=0.6,
    n_estimators=250,
    min_samples_leaf=1,
)

tuned_rf2.fit(X_train_un,y_train_un)

RandomForestClassifier(max_features='sqrt', max_samples=0.6, n_estimators=250,
                       random_state=1)

rf2_train_perf = model_performance_classification_sklearn(
    tuned_rf2, X_train_un, y_train_un
)
rf2_train_perf

rf2_val_perf = model_performance_classification_sklearn(
    tuned_rf2, X_val, y_val
) 
rf2_val_perf

The validation recall is similar to cross-validated recall. The tuned Random forest model overfitting the training data.

Tuning Gradient boosting using oversampled data

%%time 

# defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 1} with CV score=0.9723322092856124:
CPU times: user 28.1 s, sys: 1.06 s, total: 29.1 s
Wall time: 23min 16s

tuned_gbm = GradientBoostingClassifier(
    max_features=0.5,
    random_state=1,
    learning_rate=1,
    n_estimators=125,
    subsample=0.7,
)

tuned_gbm.fit(X_train_over, y_train_over)

GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           random_state=1, subsample=0.7)

gbm_train_perf = model_performance_classification_sklearn(
    tuned_gbm, X_train_over, y_train_over
 ) 
gbm_train_perf

gbm_val_perf = model_performance_classification_sklearn(
    tuned_gbm, X_val, y_val
)
gbm_val_perf

The validation recall is not similar to cross-validated recall. The tuned Gradient boosting model fits the training data

Model performance comparison and choosing the final model¶

# training performance comparison

models_train_comp_df = pd.concat(
    [
        gbm_train_perf.T,
        ada_train_perf.T,
        Adaboost_grid_train.T,
        rf2_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Gradient Boosting tuned with oversampled data",
    "AdaBoost classifier Random Search tuned with oversampled data",
    "AdaBoost classifier Grid Search tuned with oversampled data",
    "Random forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df

Training performance comparison:

# validation performance comparison

models_val_comp_df = pd.concat(
    [
        gbm_val_perf.T,
        ada_val_perf.T,
        Adaboost_grid_train.T,
        rf2_val_perf.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Gradient Boosting tuned with oversampled data",
    "AdaBoost classifier tuned with oversampled data",
    "AdaBoost classifier Grid Search tuned with oversampled data",
    "Random forest tuned with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df

Validation performance comparison:

The AdaBoost model tuned using Random search is giving the best validation recall of 0.85 but it has the low values of validation precision.
Let's check the model's performance on test set and then see the feature importance from the tuned Adaboost model.

Test set final performance¶

# Let's check the performance on test set
ada_grid_test = model_performance_classification_sklearn(tuned_ada, X_test, y_test)
print("Test performance:")
ada_grid_test

Test performance:

The performance on test data is good.
Let's check the important features for prediction as per the the final model

Feature Importances

feature_names = X_train.columns
importances =  tuned_ada.feature_importances_ 
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

V30 is the most important feature followed by V9 and V18.

Pipelines to build the final model¶

Pipeline_model = Pipeline([('imputer', SimpleImputer()), ('AdaBoost', AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=200)

)])

model.fit(X_train, y_train)

AdaBoostClassifier(random_state=1)

# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]

# Since we already have a separate test set, we don't need to divide data into train and test

X_test1 = df_test.drop(["Target"], axis=1) 
y_test1 = df_test["Target"]

# We can't oversample/undersample data without doing missing value treatment, so let's first treat the missing values in the train set
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)

Best model is built on the oversampled data.

# # Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_over1, y_over1 = sm.fit_resample(X1, Y1)

Pipeline_model.fit(X_train, y_train)

Pipeline(steps=[('imputer', SimpleImputer()),
                ('AdaBoost',
                 AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                                          random_state=1),
                                    learning_rate=0.2, n_estimators=200))])

Pipeline_model_test = model_performance_classification_sklearn(Pipeline_model, X_test, y_test)  
Pipeline_model_test

Business Insights and Conclusions¶

Adaboost classifier for over sampled data using random search is the best classifier model.

V30, V9 and V18 are the best features.

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	V38	V39	V40
0	-4.465	-4.679	3.102	0.506	-0.221	-2.033	-2.911	0.051	-1.522	3.762	-5.715	0.736	0.981	1.418	-3.376	-3.047	0.306	2.914	2.270	4.395	-2.388	0.646	-1.191	3.133	0.665	-2.511	-0.037	0.726	-3.982	-1.073	1.667	3.060	-1.690	2.846	2.235	6.667	0.444	-2.369	2.951	-3.480
1	3.366	3.653	0.910	-1.368	0.332	2.359	0.733	-4.332	0.566	-0.101	1.914	-0.951	-1.255	-2.707	0.193	-4.769	-2.205	0.908	0.757	-5.834	-3.065	1.597	-1.757	1.766	-0.267	3.625	1.500	-0.586	0.783	-0.201	0.025	-1.795	3.033	-2.468	1.895	-2.298	-1.731	5.909	-0.386	0.616
2	-3.832	-5.824	0.634	-2.419	-1.774	1.017	-2.099	-3.173	-2.082	5.393	-0.771	1.107	1.144	0.943	-3.164	-4.248	-4.039	3.689	3.311	1.059	-2.143	1.650	-1.661	1.680	-0.451	-4.551	3.739	1.134	-2.034	0.841	-1.600	-0.257	0.804	4.086	2.292	5.361	0.352	2.940	3.839	-4.309
3	1.618	1.888	7.046	-1.147	0.083	-1.530	0.207	-2.494	0.345	2.119	-3.053	0.460	2.705	-0.636	-0.454	-3.174	-3.404	-1.282	1.582	-1.952	-3.517	-1.206	-5.628	-1.818	2.124	5.295	4.748	-2.309	-3.963	-6.029	4.949	-3.584	-2.577	1.364	0.623	5.550	-1.527	0.139	3.101	-1.277
4	-0.111	3.872	-3.758	-2.983	3.793	0.545	0.205	4.849	-1.855	-6.220	1.998	4.724	0.709	-1.989	-2.633	4.184	2.245	3.734	-6.313	-5.380	-0.887	2.062	9.446	4.490	-3.945	4.582	-8.780	-3.383	5.107	6.788	2.044	8.266	6.629	-10.069	1.223	-3.230	1.687	-2.164	-3.645	6.510

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	V38	V39	V40
4995	-5.120	1.635	1.251	4.036	3.291	-2.932	-1.329	1.754	-2.985	1.249	-6.878	3.715	-2.512	-1.395	-2.554	-2.197	4.772	2.403	3.792	0.487	-2.028	1.778	3.668	11.375	-1.977	2.252	-7.319	1.907	-3.734	-0.012	2.120	9.979	0.063	0.217	3.036	2.109	-0.557	1.939	0.513	-2.694
4996	-5.172	1.172	1.579	1.220	2.530	-0.669	-2.618	-2.001	0.634	-0.579	-3.671	0.460	3.321	-1.075	-7.113	-4.356	-0.001	3.698	-0.846	-0.222	-3.645	0.736	0.926	3.278	-2.277	4.458	-4.543	-1.348	-1.779	0.352	-0.214	4.424	2.604	-2.152	0.917	2.157	0.467	0.470	2.197	-2.377
4997	-1.114	-0.404	-1.765	-5.879	3.572	3.711	-2.483	-0.308	-0.922	-2.999	-0.112	-1.977	-1.623	-0.945	-2.735	-0.813	0.610	8.149	-9.199	-3.872	-0.296	1.468	2.884	2.792	-1.136	1.198	-4.342	-2.869	4.124	4.197	3.471	3.792	7.482	-10.061	-0.387	1.849	1.818	-1.246	-1.261	7.475
4998	-1.703	0.615	6.221	-0.104	0.956	-3.279	-1.634	-0.104	1.388	-1.066	-7.970	2.262	3.134	-0.486	-3.498	-4.562	3.136	2.536	-0.792	4.398	-4.073	-0.038	-2.371	-1.542	2.908	3.215	-0.169	-1.541	-4.724	-5.525	1.668	-4.100	-5.949	0.550	-1.574	6.824	2.139	-4.036	3.436	0.579
4999	-0.604	0.960	-0.721	8.230	-1.816	-2.276	-2.575	-1.041	4.130	-2.731	-3.292	-1.674	0.465	-1.646	-5.263	-7.988	6.480	0.226	4.963	6.752	-6.306	3.271	1.897	3.271	-0.637	-0.925	-6.759	2.990	-0.814	3.499	-8.435	2.370	-1.062	0.791	4.952	-7.441	-0.070	-0.918	-2.291	-5.363

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	V38	V39	V40	Target
count	19982.000	19982.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000	20000.000
mean	-0.272	0.440	2.485	-0.083	-0.054	-0.995	-0.879	-0.548	-0.017	-0.013	-1.895	1.605	1.580	-0.951	-2.415	-2.925	-0.134	1.189	1.182	0.024	-3.611	0.952	-0.366	1.134	-0.002	1.874	-0.612	-0.883	-0.986	-0.016	0.487	0.304	0.050	-0.463	2.230	1.515	0.011	-0.344	0.891	-0.876	0.056
std	3.442	3.151	3.389	3.432	2.105	2.041	1.762	3.296	2.161	2.193	3.124	2.930	2.875	1.790	3.355	4.222	3.345	2.592	3.397	3.669	3.568	1.652	4.032	3.912	2.017	3.435	4.369	1.918	2.684	3.005	3.461	5.500	3.575	3.184	2.937	3.801	1.788	3.948	1.753	3.012	0.229
min	-11.876	-12.320	-10.708	-15.082	-8.603	-10.227	-7.950	-15.658	-8.596	-9.854	-14.832	-12.948	-13.228	-7.739	-16.417	-20.374	-14.091	-11.644	-13.492	-13.923	-17.956	-10.122	-14.866	-16.387	-8.228	-11.834	-14.905	-9.269	-12.579	-14.796	-13.723	-19.877	-16.898	-17.985	-15.350	-14.833	-5.478	-17.375	-6.439	-11.024	0.000
25%	-2.737	-1.641	0.207	-2.348	-1.536	-2.347	-2.031	-2.643	-1.495	-1.411	-3.922	-0.397	-0.224	-2.171	-4.415	-5.634	-2.216	-0.404	-1.050	-2.433	-5.930	-0.118	-3.099	-1.468	-1.365	-0.338	-3.652	-2.171	-2.787	-1.867	-1.818	-3.420	-2.243	-2.137	0.336	-0.944	-1.256	-2.988	-0.272	-2.940	0.000
50%	-0.748	0.472	2.256	-0.135	-0.102	-1.001	-0.917	-0.389	-0.068	0.101	-1.921	1.508	1.637	-0.957	-2.383	-2.683	-0.015	0.883	1.279	0.033	-3.533	0.975	-0.262	0.969	0.025	1.951	-0.885	-0.891	-1.176	0.184	0.490	0.052	-0.066	-0.255	2.099	1.567	-0.128	-0.317	0.919	-0.921	0.000
75%	1.840	2.544	4.566	2.131	1.340	0.380	0.224	1.723	1.409	1.477	0.119	3.571	3.460	0.271	-0.359	-0.095	2.069	2.572	3.493	2.512	-1.266	2.026	2.452	3.546	1.397	4.130	2.189	0.376	0.630	2.036	2.731	3.762	2.255	1.437	4.064	3.984	1.176	2.279	2.058	1.120	0.000
max	15.493	13.089	17.091	13.236	8.134	6.976	8.006	11.679	8.138	8.108	11.826	15.081	15.420	5.671	12.246	13.583	16.756	13.180	13.238	16.052	13.840	7.410	14.459	17.163	8.223	16.836	17.560	6.528	10.722	12.506	17.255	23.633	16.692	14.358	15.291	19.330	7.467	15.290	7.760	10.654	1.000

	Gradient Boosting tuned with oversampled data	AdaBoost classifier Random Search tuned with oversampled data	AdaBoost classifier Grid Search tuned with oversampled data	Random forest tuned with undersampled data
Accuracy	0.993	0.992	0.949	0.988
Recall	0.992	0.988	0.926	0.977
Precision	0.994	0.995	0.972	0.999
F1	0.993	0.992	0.948	0.988

	Gradient Boosting tuned with oversampled data	AdaBoost classifier tuned with oversampled data	AdaBoost classifier Grid Search tuned with oversampled data	Random forest tuned with undersampled data
Accuracy	0.969	0.979	0.949	0.983
Recall	0.856	0.849	0.926	0.712
Precision	0.678	0.789	0.972	0.985
F1	0.757	0.818	0.948	0.827