Supervised Learning - Foundations Project: ReCell

Problem Statement

Business Context

Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

Data Description

The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.

  • brand_name: Name of manufacturing brand
  • os: OS on which the device runs
  • screen_size: Size of the screen in cm
  • 4g: Whether 4G is available or not
  • 5g: Whether 5G is available or not
  • main_camera_mp: Resolution of the rear camera in megapixels
  • selfie_camera_mp: Resolution of the front camera in megapixels
  • int_memory: Amount of internal memory (ROM) in GB
  • ram: Amount of RAM in GB
  • battery: Energy capacity of the device battery in mAh
  • weight: Weight of the device in grams
  • release_year: Year when the device model was released
  • days_used: Number of days the used/refurbished device has been used
  • normalized_new_price: Normalized price of a new device of the same model in euros
  • normalized_used_price: Normalized price of the used/refurbished device in euros

Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Loading the dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/MyDrive/DSBA/Linear Regression/ReCell Project/used_device_data.csv') 
Mounted at /content/drive

Data Overview

  • Observations
  • Sanity checks
In [3]:
data.head() #First few rows of data
Out[3]:
brand_name os screen_size 4g 5g main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
0 Honor Android 14.50 yes no 13.0 5.0 64.0 3.0 3020.0 146.0 2020 127 4.307572 4.715100
1 Honor Android 17.30 yes yes 13.0 16.0 128.0 8.0 4300.0 213.0 2020 325 5.162097 5.519018
2 Honor Android 16.69 yes yes 13.0 8.0 128.0 8.0 4200.0 213.0 2020 162 5.111084 5.884631
3 Honor Android 25.50 yes yes 13.0 8.0 64.0 6.0 7250.0 480.0 2020 345 5.135387 5.630961
4 Honor Android 15.32 yes no 13.0 8.0 64.0 3.0 5000.0 185.0 2020 293 4.389995 4.947837
  • The dataset contains information about new/used phones and tablets features along with release year and prices.
  • Many phones seem to have OS as Android.
In [4]:
data.shape # shape of data
Out[4]:
(3454, 15)
  • The dataset contains information of 15 attributes and about 3454 features.
In [5]:
data.info() # data types of columns in the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3454 entries, 0 to 3453
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   brand_name             3454 non-null   object 
 1   os                     3454 non-null   object 
 2   screen_size            3454 non-null   float64
 3   4g                     3454 non-null   object 
 4   5g                     3454 non-null   object 
 5   main_camera_mp         3275 non-null   float64
 6   selfie_camera_mp       3452 non-null   float64
 7   int_memory             3450 non-null   float64
 8   ram                    3450 non-null   float64
 9   battery                3448 non-null   float64
 10  weight                 3447 non-null   float64
 11  release_year           3454 non-null   int64  
 12  days_used              3454 non-null   int64  
 13  normalized_used_price  3454 non-null   float64
 14  normalized_new_price   3454 non-null   float64
dtypes: float64(9), int64(2), object(4)
memory usage: 404.9+ KB
  • There are 11 numeric (float and int type) and 4 string (object type) columns in the data.
  • The target variable is the normalized used price, which is of float type.
In [6]:
data.describe(include="all").T # statistical summary of dataset
Out[6]:
count unique top freq mean std min 25% 50% 75% max
brand_name 3454 34 Others 502 NaN NaN NaN NaN NaN NaN NaN
os 3454 4 Android 3214 NaN NaN NaN NaN NaN NaN NaN
screen_size 3454.0 NaN NaN NaN 13.713115 3.80528 5.08 12.7 12.83 15.34 30.71
4g 3454 2 yes 2335 NaN NaN NaN NaN NaN NaN NaN
5g 3454 2 no 3302 NaN NaN NaN NaN NaN NaN NaN
main_camera_mp 3275.0 NaN NaN NaN 9.460208 4.815461 0.08 5.0 8.0 13.0 48.0
selfie_camera_mp 3452.0 NaN NaN NaN 6.554229 6.970372 0.0 2.0 5.0 8.0 32.0
int_memory 3450.0 NaN NaN NaN 54.573099 84.972371 0.01 16.0 32.0 64.0 1024.0
ram 3450.0 NaN NaN NaN 4.036122 1.365105 0.02 4.0 4.0 4.0 12.0
battery 3448.0 NaN NaN NaN 3133.402697 1299.682844 500.0 2100.0 3000.0 4000.0 9720.0
weight 3447.0 NaN NaN NaN 182.751871 88.413228 69.0 142.0 160.0 185.0 855.0
release_year 3454.0 NaN NaN NaN 2015.965258 2.298455 2013.0 2014.0 2015.5 2018.0 2020.0
days_used 3454.0 NaN NaN NaN 674.869716 248.580166 91.0 533.5 690.5 868.75 1094.0
normalized_used_price 3454.0 NaN NaN NaN 4.364712 0.588914 1.536867 4.033931 4.405133 4.7557 6.619433
normalized_new_price 3454.0 NaN NaN NaN 5.233107 0.683637 2.901422 4.790342 5.245892 5.673718 7.847841

We can see that used price varys between 1.5 and 6.6.

The general price of used phone is 4.4

Android is the most occurring type of OS.

Sreen size in the data varies from ~5 to ~30

Energy capacity of the device battery in mAh varies from 500 to 9720.

In [7]:
data.duplicated().sum() #checking for duplicates
Out[7]:
0

There are no duplicate values in the data

In [8]:
data.isnull().sum() #checking for missing values
Out[8]:
brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

There are missing values in many columns.

Exploratory Data Analysis (EDA)

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of normalized used device prices look like?
  2. What percentage of the used device market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
  4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
  6. A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the normalized price of a used device?
In [9]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  
        sharex=True, 
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    ) 
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    ) 
In [10]:
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  
        else:
            label = p.get_height()  

        x = p.get_x() + p.get_width()
        y = p.get_height()
        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  

    plt.show()  

Univariate Analysis

Distribution of normalized used device prices

In [11]:
histogram_boxplot(data, "normalized_used_price") # histogram_boxplot for normalized_used_price

The distribution for used price of devices is left-skewed and prices for most devices vary between 3.5 to 5.5.
There are outliers in the column.

In [12]:
histogram_boxplot(data, "normalized_new_price") # histogram_boxplot for normalized_new_price

The distribution for new price of devices is almost fairly distributed and prices for most devices vary between 4.5 to 6.

There are a lot of outliers in the column.

In [13]:
histogram_boxplot(data, "screen_size") # histogram_boxplot for screen_size

The distribution for screen size is right-skewed and there are about 1300 devices with screen size appox. 13.

There are a lot of outliers in the column.

In [14]:
histogram_boxplot(data, "main_camera_mp") # histogram_boxplot for main_camera_mp

The distribution for main camera (mp) is right-skewed and there are about 1000 devices with mp of appox. 13.

There are only a few outliers in the column.

In [15]:
histogram_boxplot(data, "selfie_camera_mp") #histogram_boxplot for selfie_camera_mp

The distribution for selfie camera (mp) is right-skewed and there are 800 devices with mp of appox. 5.

There are a few outliers in the column.

In [16]:
histogram_boxplot(data, "int_memory") #histogram_boxplot for int_memory

The distribution for internal memory is extremely right-skewed and there are 1300 devices with memory of appox. 50.

There are a few outliers in the column.

In [17]:
histogram_boxplot(data, "ram") #histogram_boxplot for ram

There is not much distribution for ram and there are 2800 devices with ram of appox. 4.

There are outliers in the column.

In [18]:
histogram_boxplot(data, "weight") #histogram_boxplot for weight

The distribution for weight is right-skewed and weight for most devices varies between 100 and 200.

There are a lot of outliers in the column.

In [19]:
histogram_boxplot(data, "battery") #histogram_boxplot for battery

The distribution for battery is right-skewed and there are 550 devices with battery of appox. 3000.

There are a lot of outliers in the column.

A large battery often increases a device's weight, making it feel uncomfortable in the hands. Larger batteries (more than 4500 mah) are heavier and sold less.

In [20]:
histogram_boxplot(data, "days_used") #histogram_boxplot for days_used

The distribution for days used is left-skewed and there are about 260 devices with number of days used of 600.

There are no outliers in the column.

In [21]:
labeled_barplot(data, "brand_name", perc=True, n=10) # barplot for brand_name

Samsung seems to be the most popular brand name that is bought.

In [22]:
labeled_barplot(data, "os") # barplot for os

Android seems to be the most used os.

In [23]:
labeled_barplot(data, "4g") # barplot for 4g
In [24]:
labeled_barplot(data, "5g") # barplot for 5g

Most phones that were bought seem to have 5g network.

In [25]:
labeled_barplot(data, "release_year") # barplot for release_year

Most of the phones bought were released in year 2014.

Bivariate Analysis

Correlation Check

In [26]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove("release_year")

plt.figure(figsize=(15, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
In [27]:
plt.figure(figsize=(15, 5)) #dataframe for ram
sns.boxplot(data=data, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()

OnePlus devices seem to have highest Ram.

In [28]:
data_large_battery = data[data.battery > 4500]
data_large_battery.shape
Out[28]:
(341, 15)
In [29]:
plt.figure(figsize=(15, 5)) #dataframe for weight
sns.boxplot(x='brand_name', y='weight', data=data_large_battery) 
plt.xticks(rotation=90)
plt.show()

Samsung phones seem to have most weight since they have larger batteries.

In [30]:
data_large_screen = data[data.screen_size > 6 * 2.54]
data_large_screen.shape
Out[30]:
(1099, 15)
In [31]:
labeled_barplot(data_large_screen, "brand_name") #barplot for large screen

Huawei and Samsung devices seem to have larger screen size.

In [32]:
data_selfie_camera = data[data.selfie_camera_mp > 8]
data_selfie_camera.shape
Out[32]:
(655, 15)
In [33]:
labeled_barplot(data_selfie_camera, "brand_name") # barplot for selfie camera

Selfie camera mp seems to be better for Huawei and Vivo devices.

In [34]:
data_main_camera = data[data.main_camera_mp > 16]
data_main_camera.shape
Out[34]:
(94, 15)
In [35]:
labeled_barplot(data_main_camera, "brand_name") # barplot for rear camera 

Main Camera seems to be better for Sony and Motorola devices.

In [36]:
#Price of the phones over the years.
plt.figure(figsize=(12, 5))
sns.lineplot(data=data , x='release_year' , y='normalized_used_price') 
plt.show()
In [37]:
# Price of the phones offering 4G and 5G networks.
plt.figure(figsize=(10, 4))

plt.subplot(121)
sns.boxplot(data=data, x="4g", y="normalized_used_price")

plt.subplot(122)
sns.boxplot(data=data, x="5g", y="normalized_used_price")

plt.show()

Data Preprocessing

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

We will impute the missing values in the data by the column medians grouped by release_year and brand_name.

In [38]:
df = data.copy()
In [39]:
df.isnull().sum() # missing values
Out[39]:
brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

Grouping data on release year and brand name and checking missing values.

In [40]:
cols_impute = [
    "main_camera_mp",
    "selfie_camera_mp",
    "int_memory",
    "ram",
    "battery",
    "weight",
]

for col in cols_impute:
    df[col] = df[col].fillna(
        value=df.groupby(['brand_name'])[col].transform("median")
    )   

df.isnull().sum() 
Out[40]:
brand_name                0
os                        0
screen_size               0
4g                        0
5g                        0
main_camera_mp           10
selfie_camera_mp          0
int_memory                0
ram                       0
battery                   0
weight                    0
release_year              0
days_used                 0
normalized_used_price     0
normalized_new_price      0
dtype: int64

Imputing the remaining missing values in the data by the column medians grouped by brand_name.

In [41]:
ols_impute = [
    "main_camera_mp",
    "selfie_camera_mp",
    "battery",
    "weight",
]

for col in cols_impute:
    df[col] = df[col].fillna(
        value=df.groupby(['brand_name'])[col].transform("median")
    ) 
df.isnull().sum() 
Out[41]:
brand_name                0
os                        0
screen_size               0
4g                        0
5g                        0
main_camera_mp           10
selfie_camera_mp          0
int_memory                0
ram                       0
battery                   0
weight                    0
release_year              0
days_used                 0
normalized_used_price     0
normalized_new_price      0
dtype: int64

Filling the remaining missing values in the main_camera_mp column by the column median.

In [42]:
df["main_camera_mp"] = df["main_camera_mp"].fillna(df["main_camera_mp"].median()) 
df.isnull().sum() 
Out[42]:
brand_name               0
os                       0
screen_size              0
4g                       0
5g                       0
main_camera_mp           0
selfie_camera_mp         0
int_memory               0
ram                      0
battery                  0
weight                   0
release_year             0
days_used                0
normalized_used_price    0
normalized_new_price     0
dtype: int64

Feature Engineering

A new column years_since_release (2021) from the release_year column was created. release_year column was dropped.

In [43]:
df["years_since_release"] = 2021 - df["release_year"]
df.drop("release_year", axis=1, inplace=True)
df["years_since_release"].describe()
Out[43]:
count    3454.000000
mean        5.034742
std         2.298455
min         1.000000
25%         3.000000
50%         5.500000
75%         7.000000
max         8.000000
Name: years_since_release, dtype: float64

EDA

  • It is a good idea to explore the data once again after manipulating it.

Outlier Check

In [44]:
num_cols = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 15))

for i, variable in enumerate(num_cols):
    plt.subplot(4, 3, i + 1)
    sns.boxplot(data=df, x=variable)
    plt.tight_layout(pad=2)

plt.show()

There are a lot of outliers in the data

Data Preparation for modeling

In [45]:
X = df.drop(["normalized_used_price"], axis=1)
y = df["normalized_used_price"]

print(X.head())
print()
print(y.head())
  brand_name       os  screen_size   4g   5g  main_camera_mp  \
0      Honor  Android        14.50  yes   no            13.0   
1      Honor  Android        17.30  yes  yes            13.0   
2      Honor  Android        16.69  yes  yes            13.0   
3      Honor  Android        25.50  yes  yes            13.0   
4      Honor  Android        15.32  yes   no            13.0   

   selfie_camera_mp  int_memory  ram  battery  weight  days_used  \
0               5.0        64.0  3.0   3020.0   146.0        127   
1              16.0       128.0  8.0   4300.0   213.0        325   
2               8.0       128.0  8.0   4200.0   213.0        162   
3               8.0        64.0  6.0   7250.0   480.0        345   
4               8.0        64.0  3.0   5000.0   185.0        293   

   normalized_new_price  years_since_release  
0              4.715100                    1  
1              5.519018                    1  
2              5.884631                    1  
3              5.630961                    1  
4              4.947837                    1  

0    4.307572
1    5.162097
2    5.111084
3    5.135387
4    4.389995
Name: normalized_used_price, dtype: float64
In [46]:
X = sm.add_constant(X)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
In [47]:
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
) 
X.head()
Out[47]:
const screen_size main_camera_mp selfie_camera_mp int_memory ram battery weight days_used normalized_new_price ... brand_name_Spice brand_name_Vivo brand_name_XOLO brand_name_Xiaomi brand_name_ZTE os_Others os_Windows os_iOS 4g_yes 5g_yes
0 1.0 14.50 13.0 5.0 64.0 3.0 3020.0 146.0 127 4.715100 ... 0 0 0 0 0 0 0 0 1 0
1 1.0 17.30 13.0 16.0 128.0 8.0 4300.0 213.0 325 5.519018 ... 0 0 0 0 0 0 0 0 1 1
2 1.0 16.69 13.0 8.0 128.0 8.0 4200.0 213.0 162 5.884631 ... 0 0 0 0 0 0 0 0 1 1
3 1.0 25.50 13.0 8.0 64.0 6.0 7250.0 480.0 345 5.630961 ... 0 0 0 0 0 0 0 0 1 1
4 1.0 15.32 13.0 8.0 64.0 3.0 5000.0 185.0 293 4.947837 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 49 columns

In [48]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 
In [49]:
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2417
Number of rows in test data = 1037

Model Building - Linear Regression

In [50]:
olsmodel = sm.OLS(y_train, x_train).fit()
print(olsmodel.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.845
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     268.8
Date:                   Sat, 28 May 2022   Prob (F-statistic):               0.00
Time:                           04:42:54   Log-Likelihood:                 124.15
No. Observations:                   2417   AIC:                            -150.3
Df Residuals:                       2368   BIC:                             133.4
Df Model:                             48                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.3158      0.071     18.461      0.000       1.176       1.456
screen_size               0.0244      0.003      7.156      0.000       0.018       0.031
main_camera_mp            0.0208      0.002     13.848      0.000       0.018       0.024
selfie_camera_mp          0.0135      0.001     11.996      0.000       0.011       0.016
int_memory                0.0001   6.97e-05      1.664      0.096   -2.07e-05       0.000
ram                       0.0232      0.005      4.515      0.000       0.013       0.033
battery               -1.686e-05   7.27e-06     -2.318      0.021   -3.11e-05    -2.6e-06
weight                    0.0010      0.000      7.488      0.000       0.001       0.001
days_used              4.196e-05   3.09e-05      1.360      0.174   -1.85e-05       0.000
normalized_new_price      0.4309      0.012     35.134      0.000       0.407       0.455
years_since_release      -0.0236      0.005     -5.189      0.000      -0.033      -0.015
brand_name_Alcatel        0.0154      0.048      0.324      0.746      -0.078       0.109
brand_name_Apple         -0.0032      0.147     -0.021      0.983      -0.292       0.285
brand_name_Asus           0.0150      0.048      0.313      0.754      -0.079       0.109
brand_name_BlackBerry    -0.0297      0.070     -0.423      0.672      -0.167       0.108
brand_name_Celkon        -0.0463      0.066     -0.699      0.484      -0.176       0.084
brand_name_Coolpad        0.0209      0.073      0.286      0.775      -0.122       0.164
brand_name_Gionee         0.0447      0.058      0.775      0.438      -0.068       0.158
brand_name_Google        -0.0327      0.085     -0.386      0.700      -0.199       0.133
brand_name_HTC           -0.0131      0.048     -0.271      0.786      -0.108       0.081
brand_name_Honor          0.0316      0.049      0.642      0.521      -0.065       0.128
brand_name_Huawei        -0.0022      0.044     -0.049      0.961      -0.089       0.085
brand_name_Infinix        0.1634      0.093      1.753      0.080      -0.019       0.346
brand_name_Karbonn        0.0943      0.067      1.406      0.160      -0.037       0.226
brand_name_LG            -0.0132      0.045     -0.292      0.771      -0.102       0.076
brand_name_Lava           0.0332      0.062      0.533      0.594      -0.089       0.155
brand_name_Lenovo         0.0453      0.045      1.003      0.316      -0.043       0.134
brand_name_Meizu         -0.0130      0.056     -0.232      0.817      -0.123       0.097
brand_name_Micromax      -0.0337      0.048     -0.704      0.481      -0.128       0.060
brand_name_Microsoft      0.0947      0.088      1.072      0.284      -0.079       0.268
brand_name_Motorola      -0.0113      0.050     -0.228      0.820      -0.109       0.086
brand_name_Nokia          0.0705      0.052      1.362      0.173      -0.031       0.172
brand_name_OnePlus        0.0707      0.077      0.913      0.361      -0.081       0.222
brand_name_Oppo           0.0124      0.048      0.259      0.796      -0.081       0.106
brand_name_Others        -0.0080      0.042     -0.191      0.849      -0.091       0.074
brand_name_Panasonic      0.0562      0.056      1.006      0.314      -0.053       0.166
brand_name_Realme         0.0319      0.062      0.517      0.605      -0.089       0.153
brand_name_Samsung       -0.0314      0.043     -0.726      0.468      -0.116       0.053
brand_name_Sony          -0.0616      0.050     -1.221      0.222      -0.161       0.037
brand_name_Spice         -0.0148      0.063     -0.234      0.815      -0.139       0.109
brand_name_Vivo          -0.0155      0.048     -0.320      0.749      -0.111       0.080
brand_name_XOLO           0.0151      0.055      0.276      0.783      -0.092       0.123
brand_name_Xiaomi         0.0868      0.048      1.804      0.071      -0.008       0.181
brand_name_ZTE           -0.0058      0.047     -0.122      0.903      -0.099       0.087
os_Others                -0.0519      0.033     -1.585      0.113      -0.116       0.012
os_Windows               -0.0202      0.045     -0.448      0.654      -0.109       0.068
os_iOS                   -0.0669      0.146     -0.457      0.648      -0.354       0.220
4g_yes                    0.0530      0.016      3.341      0.001       0.022       0.084
5g_yes                   -0.0721      0.031     -2.292      0.022      -0.134      -0.010
==============================================================================
Omnibus:                      223.220   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              422.514
Skew:                          -0.618   Prob(JB):                     1.79e-92
Kurtosis:                       4.633   Cond. No.                     1.78e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Adjusted R-squared reflects the fit of the model. The value is 0.842, which is good.

const coefficient is the Y-intercept. The value is 1.3158.

Model Performance Check

In [51]:
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100

def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  
    adjr2 = adj_r2_score(predictors, target, pred)  
    rmse = np.sqrt(mean_squared_error(target, pred))  
    mae = mean_absolute_error(target, pred) 
    mape = mape_score(target, pred)  

    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf
In [52]:
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf
Training Performance

Out[52]:
RMSE MAE R-squared Adj. R-squared MAPE
0 0.229856 0.180302 0.844924 0.841713 4.326213
In [73]:
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test) 
olsmodel_test_perf
Test Performance

Out[73]:
RMSE MAE R-squared Adj. R-squared MAPE
0 0.238482 0.184868 0.842315 0.834487 4.505694

The training R-squared is 0.844 and the model is not underfitting.

The train and test RMSE and MAE are comparable. The model is not overfitting.

MAE shows that model and predict used prices with mean error 0.18 on test data.

MAPE of 4.5 on test data means that we can predict 4.5% of prices.

Checking Linear Regression Assumptions

  • In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied.
In [54]:
# Test for Multicollinearity
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns

    vif["VIF"] = [
        variance_inflation_factor(predictors.values, i)
        for i in range(len(predictors.columns))
    ]
    return vif
In [55]:
checking_vif(x_train) 
Out[55]:
feature VIF
0 const 227.678565
1 screen_size 7.677118
2 main_camera_mp 2.284685
3 selfie_camera_mp 2.810716
4 int_memory 1.364057
5 ram 2.255246
6 battery 4.081715
7 weight 6.397183
8 days_used 2.660002
9 normalized_new_price 3.119569
10 years_since_release 4.898360
11 brand_name_Alcatel 3.405674
12 brand_name_Apple 13.054832
13 brand_name_Asus 3.332003
14 brand_name_BlackBerry 1.632250
15 brand_name_Celkon 1.774820
16 brand_name_Coolpad 1.467981
17 brand_name_Gionee 1.951247
18 brand_name_Google 1.321771
19 brand_name_HTC 3.410252
20 brand_name_Honor 3.340621
21 brand_name_Huawei 5.983857
22 brand_name_Infinix 1.283814
23 brand_name_Karbonn 1.573683
24 brand_name_LG 4.849589
25 brand_name_Lava 1.711317
26 brand_name_Lenovo 4.558847
27 brand_name_Meizu 2.179607
28 brand_name_Micromax 3.363518
29 brand_name_Microsoft 1.869558
30 brand_name_Motorola 3.274455
31 brand_name_Nokia 3.473140
32 brand_name_OnePlus 1.437047
33 brand_name_Oppo 3.971065
34 brand_name_Others 9.710921
35 brand_name_Panasonic 2.105711
36 brand_name_Realme 1.946675
37 brand_name_Samsung 7.539832
38 brand_name_Sony 2.943127
39 brand_name_Spice 1.688868
40 brand_name_Vivo 3.651320
41 brand_name_XOLO 2.138074
42 brand_name_Xiaomi 3.719678
43 brand_name_ZTE 3.797527
44 os_Others 1.854134
45 os_Windows 1.595291
46 os_iOS 11.780766
47 4g_yes 2.468374
48 5g_yes 1.811042

There are multiple columns with high VIF so there is multicollinearity. Hight VIF columns need to be dropped.

In [56]:
# Removing multicollinearity
def treating_multicollinearity(predictors, target, high_vif_columns):
    """
    Checking the effect of dropping the columns showing high multicollinearity
    on model performance (adj. R-squared and RMSE)

    predictors: independent variables
    target: dependent variable
    high_vif_columns: columns having high VIF
    """
    
    adj_r2 = []
    rmse = []

    
    for cols in high_vif_columns:
        
        train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]

 
        olsmodel = sm.OLS(target, train).fit()

        adj_r2.append(olsmodel.rsquared_adj)
        rmse.append(np.sqrt(olsmodel.mse_resid))

    temp = pd.DataFrame(
        {
            "col": high_vif_columns,
            "Adj. R-squared after_dropping col": adj_r2,
            "RMSE after dropping col": rmse,
        }
    ).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
    temp.reset_index(drop=True, inplace=True)

    return temp
In [57]:
col_list = ["brand_name_apple", "os_iOS"] 
res = treating_multicollinearity(x_train, y_train, col_list) 
res
Out[57]:
col Adj. R-squared after_dropping col RMSE after dropping col
0 os_iOS 0.841833 0.232183
1 brand_name_apple 0.841780 0.232222
In [58]:
col_to_drop = "os_iOS" 
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)] 
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping  os_iOS
Out[58]:
feature VIF
0 const 226.478213
1 screen_size 7.608412
2 main_camera_mp 2.282023
3 selfie_camera_mp 2.800555
4 int_memory 1.364057
5 ram 2.240599
6 battery 4.077782
7 weight 6.378839
8 days_used 2.659713
9 normalized_new_price 3.119556
10 years_since_release 4.896950
11 brand_name_Alcatel 3.405652
12 brand_name_Apple 2.114750
13 brand_name_Asus 3.332000
14 brand_name_BlackBerry 1.629346
15 brand_name_Celkon 1.774814
16 brand_name_Coolpad 1.467903
17 brand_name_Gionee 1.951078
18 brand_name_Google 1.321728
19 brand_name_HTC 3.409740
20 brand_name_Honor 3.340510
21 brand_name_Huawei 5.983730
22 brand_name_Infinix 1.283812
23 brand_name_Karbonn 1.573272
24 brand_name_LG 4.849233
25 brand_name_Lava 1.710944
26 brand_name_Lenovo 4.558824
27 brand_name_Meizu 2.179601
28 brand_name_Micromax 3.363324
29 brand_name_Microsoft 1.869378
30 brand_name_Motorola 3.274439
31 brand_name_Nokia 3.469823
32 brand_name_OnePlus 1.436988
33 brand_name_Oppo 3.970869
34 brand_name_Others 9.710490
35 brand_name_Panasonic 2.105587
36 brand_name_Realme 1.946598
37 brand_name_Samsung 7.539660
38 brand_name_Sony 2.943075
39 brand_name_Spice 1.688091
40 brand_name_Vivo 3.651290
41 brand_name_XOLO 2.137802
42 brand_name_Xiaomi 3.719633
43 brand_name_ZTE 3.797137
44 os_Others 1.742135
45 os_Windows 1.593582
46 4g_yes 2.467353
47 5g_yes 1.805566
In [59]:
# Dropping high p values
predictors = x_train2.copy()  
cols = predictors.columns.tolist()

max_p_value = 1

while len(cols) > 0:
    x_train_aux = predictors[cols]
    model = sm.OLS(y_train, x_train_aux).fit()
    p_values = model.pvalues
    max_p_value = max(p_values)
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'battery', 'weight', 'normalized_new_price', 'years_since_release', 'brand_name_Lenovo', 'brand_name_Nokia', 'brand_name_Xiaomi', 'os_Others', '4g_yes']
In [60]:
x_train3 = x_train2[selected_features]  
x_test3 = x_test2[selected_features]  
In [61]:
olsmodel1 = sm.OLS(y_train, x_train3).fit() 
print(olsmodel1.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.842
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     988.3
Date:                   Sat, 28 May 2022   Prob (F-statistic):               0.00
Time:                           04:42:56   Log-Likelihood:                 104.95
No. Observations:                   2417   AIC:                            -181.9
Df Residuals:                       2403   BIC:                            -100.8
Df Model:                             13                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    1.3778      0.051     26.884      0.000       1.277       1.478
screen_size              0.0256      0.003      7.760      0.000       0.019       0.032
main_camera_mp           0.0212      0.001     15.318      0.000       0.018       0.024
selfie_camera_mp         0.0140      0.001     13.199      0.000       0.012       0.016
ram                      0.0177      0.004      4.012      0.000       0.009       0.026
battery              -1.505e-05    7.1e-06     -2.119      0.034    -2.9e-05   -1.12e-06
weight                   0.0009      0.000      7.183      0.000       0.001       0.001
normalized_new_price     0.4220      0.011     39.111      0.000       0.401       0.443
years_since_release     -0.0199      0.004     -5.511      0.000      -0.027      -0.013
brand_name_Lenovo        0.0491      0.021      2.287      0.022       0.007       0.091
brand_name_Nokia         0.0665      0.031      2.175      0.030       0.007       0.127
brand_name_Xiaomi        0.0892      0.026      3.496      0.000       0.039       0.139
os_Others               -0.0710      0.030     -2.378      0.017      -0.129      -0.012
4g_yes                   0.0501      0.015      3.370      0.001       0.021       0.079
==============================================================================
Omnibus:                      232.465   Durbin-Watson:                   1.914
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              458.286
Skew:                          -0.626   Prob(JB):                    3.05e-100
Kurtosis:                       4.727   Cond. No.                     3.85e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Adjusted R squared value is same (0.842). Dropping columns seems to not have worked.

Now no feature has p-value greater than 0.05, so we'll consider the features in x_train2 as the final set of predictor variables and olsmod1 as the final model to move forward with

Adjusted R-squared is 0.842, Our model is able to explain ~84% of the variance.

RMSE and MAE values are comparable for train and test sets, indicating that the model is not overfitting

In [62]:
print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train3, y_train) 
olsmodel1_train_perf
Training Performance

Out[62]:
RMSE MAE R-squared Adj. R-squared MAPE
0 0.231688 0.181745 0.842441 0.841523 4.359443
In [63]:
print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test3, y_test) 
olsmodel1_test_perf
Test Performance

Out[63]:
RMSE MAE R-squared Adj. R-squared MAPE
0 0.236891 0.183559 0.844412 0.84228 4.469796

TEST FOR LINEARITY AND INDEPENDENCE

In [64]:
df_pred = pd.DataFrame()

df_pred["Actual Values"] = y_train 
df_pred["Fitted Values"] = olsmodel1.fittedvalues  
df_pred["Residuals"] = olsmodel1.resid  

df_pred.head()
Out[64]:
Actual Values Fitted Values Residuals
3026 4.087488 3.855295 0.232193
1525 4.448399 4.628844 -0.180445
1128 4.315353 4.280692 0.034660
3003 4.282068 4.206360 0.075709
2907 4.456438 4.470310 -0.013872
In [65]:
sns.residplot(
    data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()

The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).

If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a pattern means that the model doesn't capture non-linear effects.

We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.

TEST FOR NORMALITY

In [66]:
sns.histplot(data=df_pred, x="Residuals", kde=True) 
plt.title("Normality of residuals")
plt.show()

The histogram of residuals have almost bell shape.

In [67]:
import pylab
import scipy.stats as stats

stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab) 
plt.show()

The residuals follow a straight line in the middle.

In [68]:
stats.shapiro(df_pred["Residuals"]) # Shapiro-Wilks Test
Out[68]:
(0.9690815806388855, 2.105690889523832e-22)

Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test. The residuals are not normal. However, as an approximation, we can accept this distribution as close to being normal. So, the assumption is SATISFIED.

TEST FOR HOMOSCEDASTICITY

In [69]:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip

name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)
Out[69]:
[('F statistic', 1.043112810782595), ('p-value', 0.23292738306791388)]

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.

Final Model

In [70]:
olsmodel_final = sm.OLS(y_train, x_train3).fit()
print(olsmodel_final.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.842
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     988.3
Date:                   Sat, 28 May 2022   Prob (F-statistic):               0.00
Time:                           04:42:58   Log-Likelihood:                 104.95
No. Observations:                   2417   AIC:                            -181.9
Df Residuals:                       2403   BIC:                            -100.8
Df Model:                             13                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    1.3778      0.051     26.884      0.000       1.277       1.478
screen_size              0.0256      0.003      7.760      0.000       0.019       0.032
main_camera_mp           0.0212      0.001     15.318      0.000       0.018       0.024
selfie_camera_mp         0.0140      0.001     13.199      0.000       0.012       0.016
ram                      0.0177      0.004      4.012      0.000       0.009       0.026
battery              -1.505e-05    7.1e-06     -2.119      0.034    -2.9e-05   -1.12e-06
weight                   0.0009      0.000      7.183      0.000       0.001       0.001
normalized_new_price     0.4220      0.011     39.111      0.000       0.401       0.443
years_since_release     -0.0199      0.004     -5.511      0.000      -0.027      -0.013
brand_name_Lenovo        0.0491      0.021      2.287      0.022       0.007       0.091
brand_name_Nokia         0.0665      0.031      2.175      0.030       0.007       0.127
brand_name_Xiaomi        0.0892      0.026      3.496      0.000       0.039       0.139
os_Others               -0.0710      0.030     -2.378      0.017      -0.129      -0.012
4g_yes                   0.0501      0.015      3.370      0.001       0.021       0.079
==============================================================================
Omnibus:                      232.465   Durbin-Watson:                   1.914
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              458.286
Skew:                          -0.626   Prob(JB):                    3.05e-100
Kurtosis:                       4.727   Cond. No.                     3.85e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [71]:
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(olsmodel_final, x_train3, y_train) 
olsmodel_final_train_perf
Training Performance

Out[71]:
RMSE MAE R-squared Adj. R-squared MAPE
0 0.231688 0.181745 0.842441 0.841523 4.359443
In [72]:
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel_final, x_test3, y_test) 
Test Performance

Actionable Insights and Recommendations

The model is able to explain ~84% of the variation in the data and within 4.5% of the prices on the test data, which is good

This indicates that the model is good for prediction as well as inference purposes.

If the screen size of the device increases by one unit, then its price increases by 0.0256 units, all other variables held constant.

If the mp for main camera increases by one unit, then its price increases by 0.0212 units, all other variables held constant.

If the mp for selfie camera increases by one unit, then its price increases by 0.0140 units, all other variables held constant.

If the battery size increases, price reduces.

As screen size contributes to the price, company needs to enhance marketing activities for screen size.

The company can also advertize main camera and selfie camera features which can help with more purchase of the devices.