Marks: 60
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
The data contains the different data related to a food order. The detailed data dictionary is given below.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# read the data
df = pd.read_csv('foodhub_order.csv')
# returns the first 5 rows
df.head()
The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.
# check the shape of the dataset
df.shape
# use info() to print a concise summary of the DataFrame
df.info()
There are a total of 1898 non-null observations in each of the columns.
The dataset contains 9 columns: 4 are of integer type ('order_id', 'customer_id', 'food_preparation_time', 'delivery_time'), 1 is of floating point type ('cost_of_the_order') and 4 are of the general object type ('restaurant_name', 'cuisine_type', 'day_of_the_week', 'rating').
Total memory usage is approximately 133.6 KB.
# Checking for missing values
df.isnull().sum()
# get the summary statistics of the numerical data
df.describe()
Order ID and Customer ID are just identifiers for each order.
The cost of an order ranges from 4.47 to 35.41 dollars, with an average order costing around 16 dollars and a standard deviation of 7.5 dollars. The cost of 75% of the orders are below 23 dollars. This indicates that most of the customers prefer low-cost food compared to the expensive ones.
Food preparation time ranges from 20 to 35 minutes, with an average of around 27 minutes and a standard deviation of 4.6 minutes. The spread is not very high for the food preparation time.
Delivery time ranges from 15 to 33 minutes, with an average of around 24 minutes and a standard deviation of 5 minutes. The spread is not too high for delivery time either.
df['rating'].value_counts()
# check unique order ID
df['order_id'].nunique()
# check unique customer ID
df['customer_id'].nunique()
# check unique restaurant name
df['restaurant_name'].nunique()
There are 178 unique restaurants in the dataset.
Let's check the number of orders that get served by the restaurants.
df['restaurant_name'].value_counts()
# check unique cuisine type
df['cuisine_type'].nunique()
plt.figure(figsize = (15,5))
sns.countplot(data = df, x = 'cuisine_type');
There are 14 unique cuisines in the dataset.
The distribution of cuisine types show that cuisine types are not equally distributed.
The most frequent cuisine type is American followed by Japanese and Italian.
Vietnamese appears to be the least popular of all the cuisines.
sns.histplot(data=df,x='cost_of_the_order')
plt.show()
sns.boxplot(data=df,x='cost_of_the_order')
plt.show()
The average cost of the order is greater than the median cost indicating that the distribution for the cost of the order is right-skewed.
The mode of the distribution indicates that a large chunk of people prefer to order food that costs around 10-12 dollars.
There are few orders that cost greater than 30 dollars. These orders might be for some expensive meals.
# check the unique values
df['day_of_the_week'].value_counts()
sns.countplot(data = df, x = 'day_of_the_week')
# check the unique values
df['rating'].value_counts()
sns.countplot(data = df, x = 'rating')
The distribution of 'rating' shows that the most frequent rating category is 'not given', followed by a rating of 5.
Only around 200 orders have been rated 3.
sns.histplot(data=df,x='food_preparation_time')
plt.show()
sns.boxplot(data=df,x='food_preparation_time')
plt.show()
The average food preparation time is almost equal to the median food preparation time indicating that the distribution is nearly symmetrical.
The food preparation time is pretty evenly distributed between 20 and 35 minutes.
There are no outliers in this column.
sns.histplot(data=df,x='delivery_time')
plt.show()
sns.boxplot(data=df,x='delivery_time')
plt.show()
The average delivery time is a bit smaller than the median delivery time indicating that the distribution is a bit left-skewed.
Comparatively more number of orders have delivery time between 24 and 30 minutes.
There are no outliers in this column.
# Get top 5 restaurants with highest number of orders
df['restaurant_name'].value_counts()[:5]
Top 5 popular restaurants that have received the highest number of orders 'Shake Shack', 'The Meatball Shop', 'Blue Ribbon Sushi', 'Blue Ribbon Fried Chicken' and 'Parm'.
Almost 33% of the orders in the dataset are from these restaurants.
# Get most popular cuisine on weekends
df_weekend = df[df['day_of_the_week'] == 'Weekend']
df_weekend['cuisine_type'].value_counts()
# Get orders that cost above 20 dollars
df_greater_than_20 = df[df['cost_of_the_order'] > 20]
# Calculate the number of total orders where the cost is above 20 dollars
print('The number of total orders that cost above 20 dollars is:', df_greater_than_20.shape[0])
# Calculate percentage of such orders in the dataset
percentage = (df_greater_than_20.shape[0] / df.shape[0]) * 100
print("Percentage of orders above 20 dollars:", round(percentage, 2), '%')
There are a total of 555 orders that cost above 20 dollars.
The percentage of such orders in the dataset is around 29.24%.
# get the mean delivery time
print('The mean delivery time for this dataset is', round(df['delivery_time'].mean(), 2), 'minutes')
# Get the counts of each customer_id
df['customer_id'].value_counts().head()
# Relationship between cost of the order and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "cuisine_type", y = "cost_of_the_order", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()
# Relationship between food preparation time and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "cuisine_type", y = "food_preparation_time", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()
# Relationship between day of the week and delivery time
plt.figure(figsize=(15,7))
sns.boxplot(x = "day_of_the_week", y = "delivery_time", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()
plt.figure(figsize = (15, 7))
df.groupby(['restaurant_name'])['cost_of_the_order'].sum().sort_values(ascending = False).head(14)
# Relationship between rating and delivery time
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'rating', y = 'delivery_time', data = df)
plt.show()
# Relationship between rating and food preparation time
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'rating', y = 'food_preparation_time', data = df)
plt.show()
# Relationship between rating and cost of the order
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'rating', y = 'cost_of_the_order', data = df)
plt.show()
# plot the heatmap
col_list = ['cost_of_the_order', 'food_preparation_time', 'delivery_time']
plt.figure(figsize=(15, 7))
sns.heatmap(df[col_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
# convert rating column from object to integer
df_rated['rating'] = df_rated['rating'].astype('int')
# create a dataframe that contains the restaurant names with their rating counts
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
# get the restaurant names that have rating count more than 50
rest_names = df_rating_count[df_rating_count['rating'] > 50]['restaurant_name']
# filter to get the data of restaurants that have rating count more than 50
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
# find the mean rating of the restaurants
df_mean_4.groupby(df_mean_4['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index()
#function to determine the net revenue
def compute_rev(x):
if x > 20:
return x*0.25
elif x > 5:
return x*0.15
else:
return x*0
df['Revenue'] = df['cost_of_the_order'].apply(compute_rev)
df.head()
# get the total revenue and print it
total_rev = df['Revenue'].sum()
print('The net revenue is around', round(total_rev, 2), 'dollars')
# add a new column to the dataframe df to store the total delivery time
df['total_time'] = df['food_preparation_time'] + df['delivery_time']
# find the percentage of orders that have more than 60 minutes of total delivery time
print ('The percentage of orders that have more than 60 minutes of total delivery time is',
round(df[df['total_time'] > 60].shape[0] / df.shape[0] * 100, 2),'%')
# get the mean delivery time on weekdays and print it
print('The mean delivery time on weekdays is around',
round(df[df['day_of_the_week'] == 'Weekday']['delivery_time'].mean()),
'minutes')
# get the mean delivery time on weekends and print it
print('The mean delivery time on weekends is around',
round(df[df['day_of_the_week'] == 'Weekend']['delivery_time'].mean()),
'minutes')
The mean delivery time on weekdays is around 28 minutes whereas the mean delivery time on weekends is around 22 minutes.
This could be due to the dip of traffic volume in the weekends.
FoodHub should integrate with restaurants serving American, Japanese, Italian and Chinese cuisines as these cuisines are very popular among FoodHub customers.
FoodHub should provide promotional offers to top-rated popular restaurants like Shake Shack that serve most of the orders.
As the order volume is high during the weekends, more delivery persons should be employed during the weekends to ensure timely delivery of the order. Weekend promotional offers should be given to the customers to increase the food orders during weekends.
Customer Rating is a very important factor to gauge customer satisfaction. The company should investigate the reason behind the low count of ratings. They can redesign the rating page in the app and make it more interactive to lure the customers to rate the order.
Around 11% of the total orders have more than 60 minutes of total delivery time. FoodHub should try to minimize such instances in order to avoid customer dissatisfaction. They can provide some reward to the punctual delivery persons.