Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/MyDrive/DSBA/Linear Regression/ReCell Project/used_device_data.csv')
data.head() #First few rows of data
data.shape # shape of data
data.info() # data types of columns in the dataset
data.describe(include="all").T # statistical summary of dataset
We can see that used price varys between 1.5 and 6.6.
The general price of used phone is 4.4
Android is the most occurring type of OS.
Sreen size in the data varies from ~5 to ~30
Energy capacity of the device battery in mAh varies from 500 to 9720.
data.duplicated().sum() #checking for duplicates
There are no duplicate values in the data
data.isnull().sum() #checking for missing values
There are missing values in many columns.
Questions:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
)
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
)
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature])
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
)
else:
label = p.get_height()
x = p.get_x() + p.get_width()
y = p.get_height()
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
plt.show()
Univariate Analysis
Distribution of normalized used device prices
histogram_boxplot(data, "normalized_used_price") # histogram_boxplot for normalized_used_price
The distribution for used price of devices is left-skewed and prices for most devices vary between 3.5 to 5.5.
There are outliers in the column.
histogram_boxplot(data, "normalized_new_price") # histogram_boxplot for normalized_new_price
The distribution for new price of devices is almost fairly distributed and prices for most devices vary between 4.5 to 6.
There are a lot of outliers in the column.
histogram_boxplot(data, "screen_size") # histogram_boxplot for screen_size
The distribution for screen size is right-skewed and there are about 1300 devices with screen size appox. 13.
There are a lot of outliers in the column.
histogram_boxplot(data, "main_camera_mp") # histogram_boxplot for main_camera_mp
The distribution for main camera (mp) is right-skewed and there are about 1000 devices with mp of appox. 13.
There are only a few outliers in the column.
histogram_boxplot(data, "selfie_camera_mp") #histogram_boxplot for selfie_camera_mp
The distribution for selfie camera (mp) is right-skewed and there are 800 devices with mp of appox. 5.
There are a few outliers in the column.
histogram_boxplot(data, "int_memory") #histogram_boxplot for int_memory
The distribution for internal memory is extremely right-skewed and there are 1300 devices with memory of appox. 50.
There are a few outliers in the column.
histogram_boxplot(data, "ram") #histogram_boxplot for ram
There is not much distribution for ram and there are 2800 devices with ram of appox. 4.
There are outliers in the column.
histogram_boxplot(data, "weight") #histogram_boxplot for weight
The distribution for weight is right-skewed and weight for most devices varies between 100 and 200.
There are a lot of outliers in the column.
histogram_boxplot(data, "battery") #histogram_boxplot for battery
The distribution for battery is right-skewed and there are 550 devices with battery of appox. 3000.
There are a lot of outliers in the column.
A large battery often increases a device's weight, making it feel uncomfortable in the hands. Larger batteries (more than 4500 mah) are heavier and sold less.
histogram_boxplot(data, "days_used") #histogram_boxplot for days_used
The distribution for days used is left-skewed and there are about 260 devices with number of days used of 600.
There are no outliers in the column.
labeled_barplot(data, "brand_name", perc=True, n=10) # barplot for brand_name
Samsung seems to be the most popular brand name that is bought.
labeled_barplot(data, "os") # barplot for os
Android seems to be the most used os.
labeled_barplot(data, "4g") # barplot for 4g
labeled_barplot(data, "5g") # barplot for 5g
Most phones that were bought seem to have 5g network.
labeled_barplot(data, "release_year") # barplot for release_year
Most of the phones bought were released in year 2014.
Bivariate Analysis
Correlation Check
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove("release_year")
plt.figure(figsize=(15, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
plt.figure(figsize=(15, 5)) #dataframe for ram
sns.boxplot(data=data, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()
OnePlus devices seem to have highest Ram.
data_large_battery = data[data.battery > 4500]
data_large_battery.shape
plt.figure(figsize=(15, 5)) #dataframe for weight
sns.boxplot(x='brand_name', y='weight', data=data_large_battery)
plt.xticks(rotation=90)
plt.show()
Samsung phones seem to have most weight since they have larger batteries.
data_large_screen = data[data.screen_size > 6 * 2.54]
data_large_screen.shape
labeled_barplot(data_large_screen, "brand_name") #barplot for large screen
Huawei and Samsung devices seem to have larger screen size.
data_selfie_camera = data[data.selfie_camera_mp > 8]
data_selfie_camera.shape
labeled_barplot(data_selfie_camera, "brand_name") # barplot for selfie camera
Selfie camera mp seems to be better for Huawei and Vivo devices.
data_main_camera = data[data.main_camera_mp > 16]
data_main_camera.shape
labeled_barplot(data_main_camera, "brand_name") # barplot for rear camera
Main Camera seems to be better for Sony and Motorola devices.
#Price of the phones over the years.
plt.figure(figsize=(12, 5))
sns.lineplot(data=data , x='release_year' , y='normalized_used_price')
plt.show()
# Price of the phones offering 4G and 5G networks.
plt.figure(figsize=(10, 4))
plt.subplot(121)
sns.boxplot(data=data, x="4g", y="normalized_used_price")
plt.subplot(122)
sns.boxplot(data=data, x="5g", y="normalized_used_price")
plt.show()
We will impute the missing values in the data by the column medians grouped by release_year and brand_name.
df = data.copy()
df.isnull().sum() # missing values
Grouping data on release year and brand name and checking missing values.
cols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
for col in cols_impute:
df[col] = df[col].fillna(
value=df.groupby(['brand_name'])[col].transform("median")
)
df.isnull().sum()
Imputing the remaining missing values in the data by the column medians grouped by brand_name.
ols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"battery",
"weight",
]
for col in cols_impute:
df[col] = df[col].fillna(
value=df.groupby(['brand_name'])[col].transform("median")
)
df.isnull().sum()
Filling the remaining missing values in the main_camera_mp column by the column median.
df["main_camera_mp"] = df["main_camera_mp"].fillna(df["main_camera_mp"].median())
df.isnull().sum()
Feature Engineering
A new column years_since_release
(2021) from the release_year
column was created.
release_year
column was dropped.
df["years_since_release"] = 2021 - df["release_year"]
df.drop("release_year", axis=1, inplace=True)
df["years_since_release"].describe()
Outlier Check
num_cols = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 15))
for i, variable in enumerate(num_cols):
plt.subplot(4, 3, i + 1)
sns.boxplot(data=df, x=variable)
plt.tight_layout(pad=2)
plt.show()
There are a lot of outliers in the data
Data Preparation for modeling
X = df.drop(["normalized_used_price"], axis=1)
y = df["normalized_used_price"]
print(X.head())
print()
print(y.head())
X = sm.add_constant(X)
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
olsmodel = sm.OLS(y_train, x_train).fit()
print(olsmodel.summary())
Adjusted R-squared reflects the fit of the model. The value is 0.842, which is good.
const coefficient is the Y-intercept. The value is 1.3158.
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
pred = model.predict(predictors)
r2 = r2_score(target, pred)
adjr2 = adj_r2_score(predictors, target, pred)
rmse = np.sqrt(mean_squared_error(target, pred))
mae = mean_absolute_error(target, pred)
mape = mape_score(target, pred)
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test)
olsmodel_test_perf
The training R-squared is 0.844 and the model is not underfitting.
The train and test RMSE and MAE are comparable. The model is not overfitting.
MAE shows that model and predict used prices with mean error 0.18 on test data.
MAPE of 4.5 on test data means that we can predict 4.5% of prices.
# Test for Multicollinearity
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train)
There are multiple columns with high VIF so there is multicollinearity. Hight VIF columns need to be dropped.
# Removing multicollinearity
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
adj_r2 = []
rmse = []
for cols in high_vif_columns:
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
olsmodel = sm.OLS(target, train).fit()
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
col_list = ["brand_name_apple", "os_iOS"]
res = treating_multicollinearity(x_train, y_train, col_list)
res
col_to_drop = "os_iOS"
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
# Dropping high p values
predictors = x_train2.copy()
cols = predictors.columns.tolist()
max_p_value = 1
while len(cols) > 0:
x_train_aux = predictors[cols]
model = sm.OLS(y_train, x_train_aux).fit()
p_values = model.pvalues
max_p_value = max(p_values)
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
x_train3 = x_train2[selected_features]
x_test3 = x_test2[selected_features]
olsmodel1 = sm.OLS(y_train, x_train3).fit()
print(olsmodel1.summary())
Adjusted R squared value is same (0.842). Dropping columns seems to not have worked.
Now no feature has p-value greater than 0.05, so we'll consider the features in x_train2 as the final set of predictor variables and olsmod1 as the final model to move forward with
Adjusted R-squared is 0.842, Our model is able to explain ~84% of the variance.
RMSE and MAE values are comparable for train and test sets, indicating that the model is not overfitting
print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train3, y_train)
olsmodel1_train_perf
print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test3, y_test)
olsmodel1_test_perf
TEST FOR LINEARITY AND INDEPENDENCE
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train
df_pred["Fitted Values"] = olsmodel1.fittedvalues
df_pred["Residuals"] = olsmodel1.resid
df_pred.head()
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).
If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a pattern means that the model doesn't capture non-linear effects.
We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.
TEST FOR NORMALITY
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
The histogram of residuals have almost bell shape.
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
The residuals follow a straight line in the middle.
stats.shapiro(df_pred["Residuals"]) # Shapiro-Wilks Test
Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test. The residuals are not normal. However, as an approximation, we can accept this distribution as close to being normal. So, the assumption is SATISFIED.
TEST FOR HOMOSCEDASTICITY
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)
Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.
olsmodel_final = sm.OLS(y_train, x_train3).fit()
print(olsmodel_final.summary())
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(olsmodel_final, x_train3, y_train)
olsmodel_final_train_perf
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel_final, x_test3, y_test)
The model is able to explain ~84% of the variation in the data and within 4.5% of the prices on the test data, which is good
This indicates that the model is good for prediction as well as inference purposes.
If the screen size of the device increases by one unit, then its price increases by 0.0256 units, all other variables held constant.
If the mp for main camera increases by one unit, then its price increases by 0.0212 units, all other variables held constant.
If the mp for selfie camera increases by one unit, then its price increases by 0.0140 units, all other variables held constant.
If the battery size increases, price reduces.
As screen size contributes to the price, company needs to enhance marketing activities for screen size.
The company can also advertize main camera and selfie camera features which can help with more purchase of the devices.