A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# Importing libraries
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
from sklearn.model_selection import train_test_split
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Importing data set from google drive
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/MyDrive/DSBA/Logistic Regression/INN Hotels Project/INNHotelsGroup.csv')
# First 5 rows
data.head()
# Last 5 rows
data.tail()
# Shape of data
data.shape
# Data types of the columns in dataset
data.info()
There are 14 numeric (float and int type) and 5 string (object type) columns in the data.
# Checking duplicate values
data.duplicated().sum()
There are no duplicate values in the dataset.
# Dropping Booking_ID column
data = data.drop(["Booking_ID"], axis=1)
data.head()
Leading Questions:
# Statistical summary of data
data.describe(include="all")
Most popular meal plan is Meal Plan 1
Room_Type 1 is the most reserved room type
Average price of room seems to be ~103 euros
Bookings for week nights seem to be more than weekend nights
Average time from booking to staying at the hotel seems to be ~85 days
Number of previous cancellations seem to be 13
Univariate Analysis
# Functions for histogram
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
)
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
)
# Plotting Lead_time
histogram_boxplot(data, "lead_time")
There seem to be lot of outliers in the data for lead_time.
# Plotting Average room per person
histogram_boxplot(data, "avg_price_per_room")
There seem to be lot of outliers in the data for average price per room.
data[data["avg_price_per_room"] == 0]
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
# Calculating 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)
# Calculating 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75) ## Complete the code to calculate 75th quantile for average price per room
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
# Plotting previous booking cancellations
histogram_boxplot(data, "no_of_previous_cancellations")
# Plotting previous booking not cancellations
histogram_boxplot(data, "no_of_previous_bookings_not_canceled")
# Functions for labeled barplot
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature])
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
)
else:
label = p.get_height()
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
plt.show()
# Plotting number of adults
labeled_barplot(data, "no_of_adults", perc=True)
#Plotting number of children
labeled_barplot(data, "no_of_children", perc=True)
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
# Plotting number of week nights
labeled_barplot(data, "no_of_week_nights", perc=True)
# Plotting number of weekend nights
labeled_barplot(data, "no_of_weekend_nights", perc=True)
# Plotting Required car parking space
labeled_barplot(data, "required_car_parking_space", perc=True)
# Plotting Type of meal plan
labeled_barplot(data, "type_of_meal_plan", perc=True)
# Plotting Room Type Reserved
labeled_barplot(data, "room_type_reserved", perc=True)
# Plotting Arrival month
labeled_barplot(data, "arrival_month", perc=True)
October seems to be the busiest month with most number of arrivals.
# Plotting Market Segment Type
labeled_barplot(data, "market_segment_type", perc=True)
Online seem to be the most commonly used segment through which guests come to the hotel.
# Plotting Number of special requests
labeled_barplot(data, "no_of_special_requests", perc=True)
# Plotting Booking status
labeled_barplot(data, "booking_status", perc=True)
# Encoding canceled bookings to 2 and Not_canceled as 0
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
Bivariate Analysis
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
## Function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
## Functions for Stacked barplot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Plotting Prices of rooms across various market segments
plt.figure(figsize=(10, 6))
sns.boxplot(
data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
)
plt.show()
Average room price per room seems to be higher for Online market segment type.
# Plotting Booking status across various market segments
stacked_barplot(data, "market_segment_type", "booking_status")
# Plotting Booking status and Number of special requests
stacked_barplot(data, "no_of_special_requests", "booking_status")
Customers seem to confirm booking with increased special requests.
# Plotting Number of special requests and Average prie per room
plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="no_of_special_requests", y="avg_price_per_room", palette="gist_rainbow")
plt.show()
Average price per room seems to increase with number of special requests from guests.
# Distribution between Average price per room and booking status
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
# Distribution between Lead time and booking status
distribution_plot_wrt_target(data, "lead_time", "booking_status")
# Combining children and adults to families
family_data = data[(data["no_of_children"] >= 0) & (data["no_of_adults"] > 1)]
family_data.shape
family_data["no_of_family_members"] = (
family_data["no_of_adults"] + family_data["no_of_children"]
)
# Plotting Number of families and booking status
stacked_barplot(family_data, "no_of_family_members", "booking_status")
# Combining week days and weekend stays
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data.shape
stay_data["total_days"] = (
stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"]
)
stacked_barplot(stay_data, "total_days", "booking_status")
# Plotting Repeated guests and booking status
stacked_barplot(data, "repeated_guest", "booking_status")
Repeated guests seem to confirm bookings at the hotel.
# Busiest months at hotel with grouping
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()
# Percentage of bookings canceled each month
stacked_barplot(data, "arrival_month", "booking_status")
Cancelations of bookings seem to be more in July and least in January.
# Plotting averge price per room and arrival month
plt.figure(figsize=(10, 5))
sns.lineplot(data=data, x="arrival_month", y="avg_price_per_room")
plt.show()
Average price per room seems to increase during busier months.
Outlier Check
# checking for outliers using boxplot by dropping booking status
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Data Preparation for modeling
# Encoding categorical values and splitting data into test & train
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
X = pd.get_dummies(X, drop_first=True)
X.head()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
# Model evaluation criteria
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
## function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Checking for Multicollinearity
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(X_train)
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
X = sm.add_constant(X)
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=1)
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(
disp=False
)
print(lg.summary())
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Dropping high P values
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(
disp=False
)
print(lg1.summary())
All P values are less than 0.5, hence this model is accurate.
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
# Converting coefficients to odds
odds = np.exp(lg1.params)
perc_change_odds = (np.exp(lg1.params) - 1) * 100
pd.set_option("display.max_columns", None)
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
Odds of booking getting canceled for no of adults is 1.1 times or 12.8 %
Odds of booking getting canceled for no of children is 1.1 times or 16.1 %
Odds of booking getting canceled for weekends is 1.1 times or 11.1 %
Odds of booking getting canceled for weekdays is 1.0 times or 4.0 %
Odds of requiring car parking space is 0.1 and about 80% chance of booking getting canceled.
Odds of booking getting canceled for lead time is 1.0 times or 1.5 %
Odds of booking getting canceled based on arrival year is 1.5 times or 58.9 %
Odds of booking getting canceled based on arrival month is 0.9 times or -4.47 %
Odds of booking getting canceled for average price per room is 1.0 times or 1.9 %
Odds of booking getting canceled based on no of special requests is 0.2 times or -77.2 %
Model performance on Training set
# Creatimg confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(lg1, X_train1, y_train) ## Complete the code to check performance on X_train1 and y_train
log_reg_model_train_perf
F1 is 0.68 so further analysis needs to be done.
ROC-AUC on Training set
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train,
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
# Precision-Recall curve
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.42
confusion_matrix_statsmodels(
lg1, X_train1, y_train
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Recall helped with the model and F1 seems better with value 0.70.
Model performance on Test set
confusion_matrix_statsmodels(lg1, X_test1, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(lg1, X_test1, y_test)
print("Test performance:")
log_reg_model_test_perf
F1 seems to be less in this case.
# ROC curve
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Using model with threshold=0.37
confusion_matrix_statsmodels(lg1, X_test1, y_test)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
# Using model with threshold=0.42
confusion_matrix_statsmodels(lg1, X_test1, y_test)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
F1 seems to increase a little for test data after threshold of 0.42.
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Logistic regression-0.37 seem to have better values.
# test performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test performance comparison:")
models_test_comp_df
Logistic regression-0.37 seem to have better values.
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
X = pd.get_dummies(X, drop_first=True) ## Complete the code to create dummies for X
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=1)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
# model performance on training data
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
# model performance on test data
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_test = model_performance_classification_sklearn(model, X_train, y_train)
decision_tree_perf_test
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Decision tree clearly seem to have good F1 values, hence it is a good model to use for predictions.
Yes Pruning is required
# Pre Pruning
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
acc_scorer = make_scorer(f1_score)
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
estimator = grid_obj.best_estimator_
estimator.fit(X_train, y_train)
# Performance on training set
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(model, X_train, y_train)
decision_tree_tune_perf_train
# Performance on Test set
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
decision_tree_tune_perf_test
Decision Tree Visual
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
# Features in tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Cost Complexity Pruning
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train) ## Complete the code to fit decision tree on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
F1 score vs Alpha for Training and Test Sets
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
# Checking performance on Training set
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
# Checking performance on Test set
confusion_matrix_sklearn(best_model, X_test, y_test)
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_test
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Comparing Decision Tree models
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Post-Pruned has higher F1 value and difference between precision and recall is high.
Pre-Prunned - Difference between precision and recall are normal.
Hotel should use Pre-prunned model
Insights -
Lead time and Average price per room have positive correlation with cancelled bookings.
Number of special requests have negative correlation with cancelled bookings.
Decision tree seems to be the better model to determine predictions of cancellations than logistic regression.
Recommendations -
Hotel needs to work on lead time, average price per room and number of special requests to keep up brand equity.
Customers can be notified of lead time and they can be asked for any special requets before stay.
Special amenities list can be provided after booking to reduce chances of cancellation.
Depending on the Lead time, average price per room can be adjusted to attract more customers to the hotel.
Depending on the number of special requests, hotel can adjust the average price per room to maintain resources and brand equity.
Hotel can come up with policies on cancellations or refunda based on lead time.