Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.float_format", lambda x: "%.3f" % x)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
pd.set_option("display.float_format", lambda x: "%.3f" % x)
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/DSBA/Model Tuning/ReneWind Project/Train.csv.csv')
df_test = pd.read_csv('/content/drive/MyDrive/DSBA/Model Tuning/ReneWind Project/Test.csv.csv')
df.shape ## dimensions of the train data
df_test.shape ## dimensions of the test data
## Creating copy of training data
data = df.copy()
## Creating copy of test data
data_test = df_test.copy()
data.head() ## top 5 rows of training data
data_test.tail() ## last 5 rows of test data
## data types of the columns in the dataset
data.info()
There are 41 numeric (float and int) columns in the data.
data.duplicated().sum() ## duplicate entries in the data
There are no duplicate values in the dataset.
data.isnull().sum() ## missing entries in the train data
data_test.isnull().sum() ## missing entries in the test data
There are missing values in train and test data.
data.describe(include="all") ## statitical summary of the train data
V36 column has the maximum value.
V16 column has the lowest value.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
for feature in df.columns:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None)
data["Target"].value_counts() ## checking the class distribution in target variable for train data
data_test["Target"].value_counts() ## checking the class distribution in target variable for test data
## Dividing train data into X and y
X = data.drop(["Target"], axis=1)
y = data["Target"]
# Splitting train dataset into training and validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)
X_train.shape # dimensions of the X_train data
X_val.shape # dimensions of the X_val data
# Dividing test data into X_test and y_test
X_test = data_test.drop(["Target"], axis=1)
y_test = data_test["Target"]
X_test.shape # dimensions of the X_test data
Missing values treatment
imputer = SimpleImputer(strategy="median")
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_train.columns)
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Missing values have been treated.
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.
We will tune Adaboost and GBM models.
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
We can see that Adaboost is giving the highest cross-validated recall followed by Random Forest and GBM.
We will tune Random forest model.
Tuning AdaBoost using oversampled data
Randomsearch CV
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [100, 150, 200],
"learning_rate": [0.2, 0.05],
"base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
tuned_ada = AdaBoostClassifier(
n_estimators= 200, learning_rate= 0.2, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
)
tuned_ada.fit(X_train_over,y_train_over)
ada_train_perf = model_performance_classification_sklearn(
tuned_ada, X_train_over, y_train_over
)
ada_train_perf
ada_val_perf = model_performance_classification_sklearn(
tuned_ada, X_val, y_val
)
ada_val_perf
The validation recall is not similar to cross-validated recall. The tuned Adaboost model fits the training data
Ada Boost - GridSearch
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(100, 150, 200),
"learning_rate": [0.2, 0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSearchCV
grid_cv.fit(X_train_over, y_train_over)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
adb_tuned1 = AdaBoostClassifier(
n_estimators=100,
learning_rate=0.2,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
# Fit the model on training data
adb_tuned1.fit(X_train_over, y_train_over)
Adaboost_grid_train = model_performance_classification_sklearn(
adb_tuned1, X_train_over, y_train_over
)
print("Training performance:")
Adaboost_grid_train
Adaboost_grid_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_grid_val
The validation recall is not similar to cross-validated recall. The tuned Adaboost model fits the training data
Tuning Random forest using undersampled data
%%time
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un) ## Complete the code to fit the model on under sampled data
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
tuned_rf2 = RandomForestClassifier(
max_features='sqrt',
random_state=1,
max_samples=0.6,
n_estimators=250,
min_samples_leaf=1,
)
tuned_rf2.fit(X_train_un,y_train_un)
rf2_train_perf = model_performance_classification_sklearn(
tuned_rf2, X_train_un, y_train_un
)
rf2_train_perf
rf2_val_perf = model_performance_classification_sklearn(
tuned_rf2, X_val, y_val
)
rf2_val_perf
The validation recall is similar to cross-validated recall. The tuned Random forest model overfitting the training data.
Tuning Gradient boosting using oversampled data
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
tuned_gbm = GradientBoostingClassifier(
max_features=0.5,
random_state=1,
learning_rate=1,
n_estimators=125,
subsample=0.7,
)
tuned_gbm.fit(X_train_over, y_train_over)
gbm_train_perf = model_performance_classification_sklearn(
tuned_gbm, X_train_over, y_train_over
)
gbm_train_perf
gbm_val_perf = model_performance_classification_sklearn(
tuned_gbm, X_val, y_val
)
gbm_val_perf
The validation recall is not similar to cross-validated recall. The tuned Gradient boosting model fits the training data
# training performance comparison
models_train_comp_df = pd.concat(
[
gbm_train_perf.T,
ada_train_perf.T,
Adaboost_grid_train.T,
rf2_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient Boosting tuned with oversampled data",
"AdaBoost classifier Random Search tuned with oversampled data",
"AdaBoost classifier Grid Search tuned with oversampled data",
"Random forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
# validation performance comparison
models_val_comp_df = pd.concat(
[
gbm_val_perf.T,
ada_val_perf.T,
Adaboost_grid_train.T,
rf2_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Gradient Boosting tuned with oversampled data",
"AdaBoost classifier tuned with oversampled data",
"AdaBoost classifier Grid Search tuned with oversampled data",
"Random forest tuned with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
# Let's check the performance on test set
ada_grid_test = model_performance_classification_sklearn(tuned_ada, X_test, y_test)
print("Test performance:")
ada_grid_test
Feature Importances
feature_names = X_train.columns
importances = tuned_ada.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
V30 is the most important feature followed by V9 and V18.
Pipeline_model = Pipeline([('imputer', SimpleImputer()), ('AdaBoost', AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=200)
)])
model.fit(X_train, y_train)
# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]
# Since we already have a separate test set, we don't need to divide data into train and test
X_test1 = df_test.drop(["Target"], axis=1)
y_test1 = df_test["Target"]
# We can't oversample/undersample data without doing missing value treatment, so let's first treat the missing values in the train set
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)
Best model is built on the oversampled data.
# # Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_over1, y_over1 = sm.fit_resample(X1, Y1)
Pipeline_model.fit(X_train, y_train)
Pipeline_model_test = model_performance_classification_sklearn(Pipeline_model, X_test, y_test)
Pipeline_model_test
Adaboost classifier for over sampled data using random search is the best classifier model.
V30, V9 and V18 are the best features.