Supervised Learning - Foundations Project: ReCell¶

Problem Statement¶

Business Context¶

Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.

Objective¶

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

Data Description¶

The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.

brand_name: Name of manufacturing brand
os: OS on which the device runs
screen_size: Size of the screen in cm
4g: Whether 4G is available or not
5g: Whether 5G is available or not
main_camera_mp: Resolution of the rear camera in megapixels
selfie_camera_mp: Resolution of the front camera in megapixels
int_memory: Amount of internal memory (ROM) in GB
ram: Amount of RAM in GB
battery: Energy capacity of the device battery in mAh
weight: Weight of the device in grams
release_year: Year when the device model was released
days_used: Number of days the used/refurbished device has been used
normalized_new_price: Normalized price of a new device of the same model in euros
normalized_used_price: Normalized price of the used/refurbished device in euros

Importing necessary libraries¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Loading the dataset¶

from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/MyDrive/DSBA/Linear Regression/ReCell Project/used_device_data.csv')

Mounted at /content/drive

Data Overview¶

Observations
Sanity checks

data.head() #First few rows of data

The dataset contains information about new/used phones and tablets features along with release year and prices.
Many phones seem to have OS as Android.

data.shape # shape of data

(3454, 15)

The dataset contains information of 15 attributes and about 3454 features.

data.info() # data types of columns in the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3454 entries, 0 to 3453
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   brand_name             3454 non-null   object 
 1   os                     3454 non-null   object 
 2   screen_size            3454 non-null   float64
 3   4g                     3454 non-null   object 
 4   5g                     3454 non-null   object 
 5   main_camera_mp         3275 non-null   float64
 6   selfie_camera_mp       3452 non-null   float64
 7   int_memory             3450 non-null   float64
 8   ram                    3450 non-null   float64
 9   battery                3448 non-null   float64
 10  weight                 3447 non-null   float64
 11  release_year           3454 non-null   int64  
 12  days_used              3454 non-null   int64  
 13  normalized_used_price  3454 non-null   float64
 14  normalized_new_price   3454 non-null   float64
dtypes: float64(9), int64(2), object(4)
memory usage: 404.9+ KB

There are 11 numeric (float and int type) and 4 string (object type) columns in the data.
The target variable is the normalized used price, which is of float type.

data.describe(include="all").T # statistical summary of dataset

We can see that used price varys between 1.5 and 6.6.

The general price of used phone is 4.4

Android is the most occurring type of OS.

Sreen size in the data varies from ~5 to ~30

Energy capacity of the device battery in mAh varies from 500 to 9720.

data.duplicated().sum() #checking for duplicates

0

There are no duplicate values in the data

data.isnull().sum() #checking for missing values

brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

There are missing values in many columns.

Exploratory Data Analysis (EDA)¶

EDA is an important part of any project involving data.
It is important to investigate and understand the data better before building a model with it.
A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

What does the distribution of normalized used device prices look like?
What percentage of the used device market is dominated by Android devices?
The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
Which attributes are highly correlated with the normalized price of a used device?

def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  
        sharex=True, 
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    ) 
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  
        else:
            label = p.get_height()  

        x = p.get_x() + p.get_width()
        y = p.get_height()
        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  

    plt.show()

Univariate Analysis

Distribution of normalized used device prices

histogram_boxplot(data, "normalized_used_price") # histogram_boxplot for normalized_used_price

The distribution for used price of devices is left-skewed and prices for most devices vary between 3.5 to 5.5.
There are outliers in the column.

histogram_boxplot(data, "normalized_new_price") # histogram_boxplot for normalized_new_price

The distribution for new price of devices is almost fairly distributed and prices for most devices vary between 4.5 to 6.

There are a lot of outliers in the column.

histogram_boxplot(data, "screen_size") # histogram_boxplot for screen_size

The distribution for screen size is right-skewed and there are about 1300 devices with screen size appox. 13.

There are a lot of outliers in the column.

histogram_boxplot(data, "main_camera_mp") # histogram_boxplot for main_camera_mp

The distribution for main camera (mp) is right-skewed and there are about 1000 devices with mp of appox. 13.

There are only a few outliers in the column.

histogram_boxplot(data, "selfie_camera_mp") #histogram_boxplot for selfie_camera_mp

The distribution for selfie camera (mp) is right-skewed and there are 800 devices with mp of appox. 5.

There are a few outliers in the column.

histogram_boxplot(data, "int_memory") #histogram_boxplot for int_memory

The distribution for internal memory is extremely right-skewed and there are 1300 devices with memory of appox. 50.

There are a few outliers in the column.

histogram_boxplot(data, "ram") #histogram_boxplot for ram

There is not much distribution for ram and there are 2800 devices with ram of appox. 4.

There are outliers in the column.

histogram_boxplot(data, "weight") #histogram_boxplot for weight

The distribution for weight is right-skewed and weight for most devices varies between 100 and 200.

There are a lot of outliers in the column.

histogram_boxplot(data, "battery") #histogram_boxplot for battery

The distribution for battery is right-skewed and there are 550 devices with battery of appox. 3000.

There are a lot of outliers in the column.

A large battery often increases a device's weight, making it feel uncomfortable in the hands. Larger batteries (more than 4500 mah) are heavier and sold less.

histogram_boxplot(data, "days_used") #histogram_boxplot for days_used

The distribution for days used is left-skewed and there are about 260 devices with number of days used of 600.

There are no outliers in the column.

labeled_barplot(data, "brand_name", perc=True, n=10) # barplot for brand_name

Samsung seems to be the most popular brand name that is bought.

labeled_barplot(data, "os") # barplot for os

Android seems to be the most used os.

labeled_barplot(data, "4g") # barplot for 4g

labeled_barplot(data, "5g") # barplot for 5g

Most phones that were bought seem to have 5g network.

labeled_barplot(data, "release_year") # barplot for release_year

Most of the phones bought were released in year 2014.

Bivariate Analysis

Correlation Check

cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove("release_year")

plt.figure(figsize=(15, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

plt.figure(figsize=(15, 5)) #dataframe for ram
sns.boxplot(data=data, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()

OnePlus devices seem to have highest Ram.

data_large_battery = data[data.battery > 4500]
data_large_battery.shape

(341, 15)

plt.figure(figsize=(15, 5)) #dataframe for weight
sns.boxplot(x='brand_name', y='weight', data=data_large_battery) 
plt.xticks(rotation=90)
plt.show()

Samsung phones seem to have most weight since they have larger batteries.

data_large_screen = data[data.screen_size > 6 * 2.54]
data_large_screen.shape

(1099, 15)

labeled_barplot(data_large_screen, "brand_name") #barplot for large screen

Huawei and Samsung devices seem to have larger screen size.

data_selfie_camera = data[data.selfie_camera_mp > 8]
data_selfie_camera.shape

(655, 15)

labeled_barplot(data_selfie_camera, "brand_name") # barplot for selfie camera

Selfie camera mp seems to be better for Huawei and Vivo devices.

data_main_camera = data[data.main_camera_mp > 16]
data_main_camera.shape

(94, 15)

labeled_barplot(data_main_camera, "brand_name") # barplot for rear camera

Main Camera seems to be better for Sony and Motorola devices.

#Price of the phones over the years.
plt.figure(figsize=(12, 5))
sns.lineplot(data=data , x='release_year' , y='normalized_used_price') 
plt.show()

# Price of the phones offering 4G and 5G networks.
plt.figure(figsize=(10, 4))

plt.subplot(121)
sns.boxplot(data=data, x="4g", y="normalized_used_price")

plt.subplot(122)
sns.boxplot(data=data, x="5g", y="normalized_used_price")

plt.show()

Data Preprocessing¶

Missing value treatment
Feature engineering (if needed)
Outlier detection and treatment (if needed)
Preparing data for modeling
Any other preprocessing steps (if needed)

We will impute the missing values in the data by the column medians grouped by release_year and brand_name.

df = data.copy()

df.isnull().sum() # missing values

brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

Grouping data on release year and brand name and checking missing values.

cols_impute = [
    "main_camera_mp",
    "selfie_camera_mp",
    "int_memory",
    "ram",
    "battery",
    "weight",
]

for col in cols_impute:
    df[col] = df[col].fillna(
        value=df.groupby(['brand_name'])[col].transform("median")
    )   

df.isnull().sum()

brand_name                0
os                        0
screen_size               0
4g                        0
5g                        0
main_camera_mp           10
selfie_camera_mp          0
int_memory                0
ram                       0
battery                   0
weight                    0
release_year              0
days_used                 0
normalized_used_price     0
normalized_new_price      0
dtype: int64

Imputing the remaining missing values in the data by the column medians grouped by brand_name.

ols_impute = [
    "main_camera_mp",
    "selfie_camera_mp",
    "battery",
    "weight",
]

for col in cols_impute:
    df[col] = df[col].fillna(
        value=df.groupby(['brand_name'])[col].transform("median")
    ) 
df.isnull().sum()

brand_name                0
os                        0
screen_size               0
4g                        0
5g                        0
main_camera_mp           10
selfie_camera_mp          0
int_memory                0
ram                       0
battery                   0
weight                    0
release_year              0
days_used                 0
normalized_used_price     0
normalized_new_price      0
dtype: int64

Filling the remaining missing values in the main_camera_mp column by the column median.

df["main_camera_mp"] = df["main_camera_mp"].fillna(df["main_camera_mp"].median()) 
df.isnull().sum()

brand_name               0
os                       0
screen_size              0
4g                       0
5g                       0
main_camera_mp           0
selfie_camera_mp         0
int_memory               0
ram                      0
battery                  0
weight                   0
release_year             0
days_used                0
normalized_used_price    0
normalized_new_price     0
dtype: int64

Feature Engineering

A new column years_since_release (2021) from the release_year column was created. release_year column was dropped.

df["years_since_release"] = 2021 - df["release_year"]
df.drop("release_year", axis=1, inplace=True)
df["years_since_release"].describe()

count    3454.000000
mean        5.034742
std         2.298455
min         1.000000
25%         3.000000
50%         5.500000
75%         7.000000
max         8.000000
Name: years_since_release, dtype: float64

EDA¶

It is a good idea to explore the data once again after manipulating it.

Outlier Check

num_cols = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 15))

for i, variable in enumerate(num_cols):
    plt.subplot(4, 3, i + 1)
    sns.boxplot(data=df, x=variable)
    plt.tight_layout(pad=2)

plt.show()

There are a lot of outliers in the data

Data Preparation for modeling

X = df.drop(["normalized_used_price"], axis=1)
y = df["normalized_used_price"]

print(X.head())
print()
print(y.head())

  brand_name       os  screen_size   4g   5g  main_camera_mp  \
0      Honor  Android        14.50  yes   no            13.0   
1      Honor  Android        17.30  yes  yes            13.0   
2      Honor  Android        16.69  yes  yes            13.0   
3      Honor  Android        25.50  yes  yes            13.0   
4      Honor  Android        15.32  yes   no            13.0   

   selfie_camera_mp  int_memory  ram  battery  weight  days_used  \
0               5.0        64.0  3.0   3020.0   146.0        127   
1              16.0       128.0  8.0   4300.0   213.0        325   
2               8.0       128.0  8.0   4200.0   213.0        162   
3               8.0        64.0  6.0   7250.0   480.0        345   
4               8.0        64.0  3.0   5000.0   185.0        293   

   normalized_new_price  years_since_release  
0              4.715100                    1  
1              5.519018                    1  
2              5.884631                    1  
3              5.630961                    1  
4              4.947837                    1  

0    4.307572
1    5.162097
2    5.111084
3    5.135387
4    4.389995
Name: normalized_used_price, dtype: float64

X = sm.add_constant(X)

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
) 
X.head()

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])

Number of rows in train data = 2417
Number of rows in test data = 1037

Model Building - Linear Regression¶

olsmodel = sm.OLS(y_train, x_train).fit()
print(olsmodel.summary())

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.845
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     268.8
Date:                   Sat, 28 May 2022   Prob (F-statistic):               0.00
Time:                           04:42:54   Log-Likelihood:                 124.15
No. Observations:                   2417   AIC:                            -150.3
Df Residuals:                       2368   BIC:                             133.4
Df Model:                             48                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.3158      0.071     18.461      0.000       1.176       1.456
screen_size               0.0244      0.003      7.156      0.000       0.018       0.031
main_camera_mp            0.0208      0.002     13.848      0.000       0.018       0.024
selfie_camera_mp          0.0135      0.001     11.996      0.000       0.011       0.016
int_memory                0.0001   6.97e-05      1.664      0.096   -2.07e-05       0.000
ram                       0.0232      0.005      4.515      0.000       0.013       0.033
battery               -1.686e-05   7.27e-06     -2.318      0.021   -3.11e-05    -2.6e-06
weight                    0.0010      0.000      7.488      0.000       0.001       0.001
days_used              4.196e-05   3.09e-05      1.360      0.174   -1.85e-05       0.000
normalized_new_price      0.4309      0.012     35.134      0.000       0.407       0.455
years_since_release      -0.0236      0.005     -5.189      0.000      -0.033      -0.015
brand_name_Alcatel        0.0154      0.048      0.324      0.746      -0.078       0.109
brand_name_Apple         -0.0032      0.147     -0.021      0.983      -0.292       0.285
brand_name_Asus           0.0150      0.048      0.313      0.754      -0.079       0.109
brand_name_BlackBerry    -0.0297      0.070     -0.423      0.672      -0.167       0.108
brand_name_Celkon        -0.0463      0.066     -0.699      0.484      -0.176       0.084
brand_name_Coolpad        0.0209      0.073      0.286      0.775      -0.122       0.164
brand_name_Gionee         0.0447      0.058      0.775      0.438      -0.068       0.158
brand_name_Google        -0.0327      0.085     -0.386      0.700      -0.199       0.133
brand_name_HTC           -0.0131      0.048     -0.271      0.786      -0.108       0.081
brand_name_Honor          0.0316      0.049      0.642      0.521      -0.065       0.128
brand_name_Huawei        -0.0022      0.044     -0.049      0.961      -0.089       0.085
brand_name_Infinix        0.1634      0.093      1.753      0.080      -0.019       0.346
brand_name_Karbonn        0.0943      0.067      1.406      0.160      -0.037       0.226
brand_name_LG            -0.0132      0.045     -0.292      0.771      -0.102       0.076
brand_name_Lava           0.0332      0.062      0.533      0.594      -0.089       0.155
brand_name_Lenovo         0.0453      0.045      1.003      0.316      -0.043       0.134
brand_name_Meizu         -0.0130      0.056     -0.232      0.817      -0.123       0.097
brand_name_Micromax      -0.0337      0.048     -0.704      0.481      -0.128       0.060
brand_name_Microsoft      0.0947      0.088      1.072      0.284      -0.079       0.268
brand_name_Motorola      -0.0113      0.050     -0.228      0.820      -0.109       0.086
brand_name_Nokia          0.0705      0.052      1.362      0.173      -0.031       0.172
brand_name_OnePlus        0.0707      0.077      0.913      0.361      -0.081       0.222
brand_name_Oppo           0.0124      0.048      0.259      0.796      -0.081       0.106
brand_name_Others        -0.0080      0.042     -0.191      0.849      -0.091       0.074
brand_name_Panasonic      0.0562      0.056      1.006      0.314      -0.053       0.166
brand_name_Realme         0.0319      0.062      0.517      0.605      -0.089       0.153
brand_name_Samsung       -0.0314      0.043     -0.726      0.468      -0.116       0.053
brand_name_Sony          -0.0616      0.050     -1.221      0.222      -0.161       0.037
brand_name_Spice         -0.0148      0.063     -0.234      0.815      -0.139       0.109
brand_name_Vivo          -0.0155      0.048     -0.320      0.749      -0.111       0.080
brand_name_XOLO           0.0151      0.055      0.276      0.783      -0.092       0.123
brand_name_Xiaomi         0.0868      0.048      1.804      0.071      -0.008       0.181
brand_name_ZTE           -0.0058      0.047     -0.122      0.903      -0.099       0.087
os_Others                -0.0519      0.033     -1.585      0.113      -0.116       0.012
os_Windows               -0.0202      0.045     -0.448      0.654      -0.109       0.068
os_iOS                   -0.0669      0.146     -0.457      0.648      -0.354       0.220
4g_yes                    0.0530      0.016      3.341      0.001       0.022       0.084
5g_yes                   -0.0721      0.031     -2.292      0.022      -0.134      -0.010
==============================================================================
Omnibus:                      223.220   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              422.514
Skew:                          -0.618   Prob(JB):                     1.79e-92
Kurtosis:                       4.633   Cond. No.                     1.78e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Adjusted R-squared reflects the fit of the model. The value is 0.842, which is good.

const coefficient is the Y-intercept. The value is 1.3158.

Model Performance Check¶

def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100

def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  
    adjr2 = adj_r2_score(predictors, target, pred)  
    rmse = np.sqrt(mean_squared_error(target, pred))  
    mae = mean_absolute_error(target, pred) 
    mape = mape_score(target, pred)  

    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf

Training Performance

print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test) 
olsmodel_test_perf

Test Performance

The training R-squared is 0.844 and the model is not underfitting.

The train and test RMSE and MAE are comparable. The model is not overfitting.

MAE shows that model and predict used prices with mean error 0.18 on test data.

MAPE of 4.5 on test data means that we can predict 4.5% of prices.

Checking Linear Regression Assumptions¶

In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied.

# Test for Multicollinearity
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns

    vif["VIF"] = [
        variance_inflation_factor(predictors.values, i)
        for i in range(len(predictors.columns))
    ]
    return vif

checking_vif(x_train)

There are multiple columns with high VIF so there is multicollinearity. Hight VIF columns need to be dropped.

# Removing multicollinearity
def treating_multicollinearity(predictors, target, high_vif_columns):
    """
    Checking the effect of dropping the columns showing high multicollinearity
    on model performance (adj. R-squared and RMSE)

    predictors: independent variables
    target: dependent variable
    high_vif_columns: columns having high VIF
    """
    
    adj_r2 = []
    rmse = []

    
    for cols in high_vif_columns:
        
        train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]

 
        olsmodel = sm.OLS(target, train).fit()

        adj_r2.append(olsmodel.rsquared_adj)
        rmse.append(np.sqrt(olsmodel.mse_resid))

    temp = pd.DataFrame(
        {
            "col": high_vif_columns,
            "Adj. R-squared after_dropping col": adj_r2,
            "RMSE after dropping col": rmse,
        }
    ).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
    temp.reset_index(drop=True, inplace=True)

    return temp

col_list = ["brand_name_apple", "os_iOS"] 
res = treating_multicollinearity(x_train, y_train, col_list) 
res

col_to_drop = "os_iOS" 
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)] 
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif

VIF after dropping  os_iOS

# Dropping high p values
predictors = x_train2.copy()  
cols = predictors.columns.tolist()

max_p_value = 1

while len(cols) > 0:
    x_train_aux = predictors[cols]
    model = sm.OLS(y_train, x_train_aux).fit()
    p_values = model.pvalues
    max_p_value = max(p_values)
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)

['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'battery', 'weight', 'normalized_new_price', 'years_since_release', 'brand_name_Lenovo', 'brand_name_Nokia', 'brand_name_Xiaomi', 'os_Others', '4g_yes']

x_train3 = x_train2[selected_features]  
x_test3 = x_test2[selected_features]

olsmodel1 = sm.OLS(y_train, x_train3).fit() 
print(olsmodel1.summary())

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.842
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     988.3
Date:                   Sat, 28 May 2022   Prob (F-statistic):               0.00
Time:                           04:42:56   Log-Likelihood:                 104.95
No. Observations:                   2417   AIC:                            -181.9
Df Residuals:                       2403   BIC:                            -100.8
Df Model:                             13                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    1.3778      0.051     26.884      0.000       1.277       1.478
screen_size              0.0256      0.003      7.760      0.000       0.019       0.032
main_camera_mp           0.0212      0.001     15.318      0.000       0.018       0.024
selfie_camera_mp         0.0140      0.001     13.199      0.000       0.012       0.016
ram                      0.0177      0.004      4.012      0.000       0.009       0.026
battery              -1.505e-05    7.1e-06     -2.119      0.034    -2.9e-05   -1.12e-06
weight                   0.0009      0.000      7.183      0.000       0.001       0.001
normalized_new_price     0.4220      0.011     39.111      0.000       0.401       0.443
years_since_release     -0.0199      0.004     -5.511      0.000      -0.027      -0.013
brand_name_Lenovo        0.0491      0.021      2.287      0.022       0.007       0.091
brand_name_Nokia         0.0665      0.031      2.175      0.030       0.007       0.127
brand_name_Xiaomi        0.0892      0.026      3.496      0.000       0.039       0.139
os_Others               -0.0710      0.030     -2.378      0.017      -0.129      -0.012
4g_yes                   0.0501      0.015      3.370      0.001       0.021       0.079
==============================================================================
Omnibus:                      232.465   Durbin-Watson:                   1.914
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              458.286
Skew:                          -0.626   Prob(JB):                    3.05e-100
Kurtosis:                       4.727   Cond. No.                     3.85e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Adjusted R squared value is same (0.842). Dropping columns seems to not have worked.

Now no feature has p-value greater than 0.05, so we'll consider the features in x_train2 as the final set of predictor variables and olsmod1 as the final model to move forward with

Adjusted R-squared is 0.842, Our model is able to explain ~84% of the variance.

RMSE and MAE values are comparable for train and test sets, indicating that the model is not overfitting

print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train3, y_train) 
olsmodel1_train_perf

Training Performance

print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test3, y_test) 
olsmodel1_test_perf

Test Performance

TEST FOR LINEARITY AND INDEPENDENCE

df_pred = pd.DataFrame()

df_pred["Actual Values"] = y_train 
df_pred["Fitted Values"] = olsmodel1.fittedvalues  
df_pred["Residuals"] = olsmodel1.resid  

df_pred.head()

sns.residplot(
    data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()

The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).

If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a pattern means that the model doesn't capture non-linear effects.

We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.

TEST FOR NORMALITY

sns.histplot(data=df_pred, x="Residuals", kde=True) 
plt.title("Normality of residuals")
plt.show()

The histogram of residuals have almost bell shape.

import pylab
import scipy.stats as stats

stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab) 
plt.show()

The residuals follow a straight line in the middle.

stats.shapiro(df_pred["Residuals"]) # Shapiro-Wilks Test

(0.9690815806388855, 2.105690889523832e-22)

Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test. The residuals are not normal. However, as an approximation, we can accept this distribution as close to being normal. So, the assumption is SATISFIED.

TEST FOR HOMOSCEDASTICITY

import statsmodels.stats.api as sms
from statsmodels.compat import lzip

name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)

[('F statistic', 1.043112810782595), ('p-value', 0.23292738306791388)]

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.

Final Model¶

olsmodel_final = sm.OLS(y_train, x_train3).fit()
print(olsmodel_final.summary())

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.842
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     988.3
Date:                   Sat, 28 May 2022   Prob (F-statistic):               0.00
Time:                           04:42:58   Log-Likelihood:                 104.95
No. Observations:                   2417   AIC:                            -181.9
Df Residuals:                       2403   BIC:                            -100.8
Df Model:                             13                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    1.3778      0.051     26.884      0.000       1.277       1.478
screen_size              0.0256      0.003      7.760      0.000       0.019       0.032
main_camera_mp           0.0212      0.001     15.318      0.000       0.018       0.024
selfie_camera_mp         0.0140      0.001     13.199      0.000       0.012       0.016
ram                      0.0177      0.004      4.012      0.000       0.009       0.026
battery              -1.505e-05    7.1e-06     -2.119      0.034    -2.9e-05   -1.12e-06
weight                   0.0009      0.000      7.183      0.000       0.001       0.001
normalized_new_price     0.4220      0.011     39.111      0.000       0.401       0.443
years_since_release     -0.0199      0.004     -5.511      0.000      -0.027      -0.013
brand_name_Lenovo        0.0491      0.021      2.287      0.022       0.007       0.091
brand_name_Nokia         0.0665      0.031      2.175      0.030       0.007       0.127
brand_name_Xiaomi        0.0892      0.026      3.496      0.000       0.039       0.139
os_Others               -0.0710      0.030     -2.378      0.017      -0.129      -0.012
4g_yes                   0.0501      0.015      3.370      0.001       0.021       0.079
==============================================================================
Omnibus:                      232.465   Durbin-Watson:                   1.914
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              458.286
Skew:                          -0.626   Prob(JB):                    3.05e-100
Kurtosis:                       4.727   Cond. No.                     3.85e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(olsmodel_final, x_train3, y_train) 
olsmodel_final_train_perf

Training Performance

print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel_final, x_test3, y_test)

Test Performance

Actionable Insights and Recommendations¶

The model is able to explain ~84% of the variation in the data and within 4.5% of the prices on the test data, which is good

This indicates that the model is good for prediction as well as inference purposes.

If the screen size of the device increases by one unit, then its price increases by 0.0256 units, all other variables held constant.

If the mp for main camera increases by one unit, then its price increases by 0.0212 units, all other variables held constant.

If the mp for selfie camera increases by one unit, then its price increases by 0.0140 units, all other variables held constant.

If the battery size increases, price reduces.

As screen size contributes to the price, company needs to enhance marketing activities for screen size.

The company can also advertize main camera and selfie camera features which can help with more purchase of the devices.

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
brand_name	3454	34	Others	502	NaN	NaN	NaN	NaN	NaN	NaN	NaN
os	3454	4	Android	3214	NaN	NaN	NaN	NaN	NaN	NaN	NaN
screen_size	3454.0	NaN	NaN	NaN	13.713115	3.80528	5.08	12.7	12.83	15.34	30.71
4g	3454	2	yes	2335	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5g	3454	2	no	3302	NaN	NaN	NaN	NaN	NaN	NaN	NaN
main_camera_mp	3275.0	NaN	NaN	NaN	9.460208	4.815461	0.08	5.0	8.0	13.0	48.0
selfie_camera_mp	3452.0	NaN	NaN	NaN	6.554229	6.970372	0.0	2.0	5.0	8.0	32.0
int_memory	3450.0	NaN	NaN	NaN	54.573099	84.972371	0.01	16.0	32.0	64.0	1024.0
ram	3450.0	NaN	NaN	NaN	4.036122	1.365105	0.02	4.0	4.0	4.0	12.0
battery	3448.0	NaN	NaN	NaN	3133.402697	1299.682844	500.0	2100.0	3000.0	4000.0	9720.0
weight	3447.0	NaN	NaN	NaN	182.751871	88.413228	69.0	142.0	160.0	185.0	855.0
release_year	3454.0	NaN	NaN	NaN	2015.965258	2.298455	2013.0	2014.0	2015.5	2018.0	2020.0
days_used	3454.0	NaN	NaN	NaN	674.869716	248.580166	91.0	533.5	690.5	868.75	1094.0
normalized_used_price	3454.0	NaN	NaN	NaN	4.364712	0.588914	1.536867	4.033931	4.405133	4.7557	6.619433
normalized_new_price	3454.0	NaN	NaN	NaN	5.233107	0.683637	2.901422	4.790342	5.245892	5.673718	7.847841

	Actual Values	Fitted Values	Residuals
3026	4.087488	3.855295	0.232193
1525	4.448399	4.628844	-0.180445
1128	4.315353	4.280692	0.034660
3003	4.282068	4.206360	0.075709
2907	4.456438	4.470310	-0.013872

	brand_name	os	screen_size	4g	5g	main_camera_mp	selfie_camera_mp	int_memory	ram	battery	weight	release_year	days_used	normalized_used_price	normalized_new_price
0	Honor	Android	14.50	yes	no	13.0	5.0	64.0	3.0	3020.0	146.0	2020	127	4.307572	4.715100
1	Honor	Android	17.30	yes	yes	13.0	16.0	128.0	8.0	4300.0	213.0	2020	325	5.162097	5.519018
2	Honor	Android	16.69	yes	yes	13.0	8.0	128.0	8.0	4200.0	213.0	2020	162	5.111084	5.884631
3	Honor	Android	25.50	yes	yes	13.0	8.0	64.0	6.0	7250.0	480.0	2020	345	5.135387	5.630961
4	Honor	Android	15.32	yes	no	13.0	8.0	64.0	3.0	5000.0	185.0	2020	293	4.389995	4.947837

	feature	VIF
0	const	227.678565
1	screen_size	7.677118
2	main_camera_mp	2.284685
3	selfie_camera_mp	2.810716
4	int_memory	1.364057
5	ram	2.255246
6	battery	4.081715
7	weight	6.397183
8	days_used	2.660002
9	normalized_new_price	3.119569
10	years_since_release	4.898360
11	brand_name_Alcatel	3.405674
12	brand_name_Apple	13.054832
13	brand_name_Asus	3.332003
14	brand_name_BlackBerry	1.632250
15	brand_name_Celkon	1.774820
16	brand_name_Coolpad	1.467981
17	brand_name_Gionee	1.951247
18	brand_name_Google	1.321771
19	brand_name_HTC	3.410252
20	brand_name_Honor	3.340621
21	brand_name_Huawei	5.983857
22	brand_name_Infinix	1.283814
23	brand_name_Karbonn	1.573683
24	brand_name_LG	4.849589
25	brand_name_Lava	1.711317
26	brand_name_Lenovo	4.558847
27	brand_name_Meizu	2.179607
28	brand_name_Micromax	3.363518
29	brand_name_Microsoft	1.869558
30	brand_name_Motorola	3.274455
31	brand_name_Nokia	3.473140
32	brand_name_OnePlus	1.437047
33	brand_name_Oppo	3.971065
34	brand_name_Others	9.710921
35	brand_name_Panasonic	2.105711
36	brand_name_Realme	1.946675
37	brand_name_Samsung	7.539832
38	brand_name_Sony	2.943127
39	brand_name_Spice	1.688868
40	brand_name_Vivo	3.651320
41	brand_name_XOLO	2.138074
42	brand_name_Xiaomi	3.719678
43	brand_name_ZTE	3.797527
44	os_Others	1.854134
45	os_Windows	1.595291
46	os_iOS	11.780766
47	4g_yes	2.468374
48	5g_yes	1.811042

	feature	VIF
0	const	226.478213
1	screen_size	7.608412
2	main_camera_mp	2.282023
3	selfie_camera_mp	2.800555
4	int_memory	1.364057
5	ram	2.240599
6	battery	4.077782
7	weight	6.378839
8	days_used	2.659713
9	normalized_new_price	3.119556
10	years_since_release	4.896950
11	brand_name_Alcatel	3.405652
12	brand_name_Apple	2.114750
13	brand_name_Asus	3.332000
14	brand_name_BlackBerry	1.629346
15	brand_name_Celkon	1.774814
16	brand_name_Coolpad	1.467903
17	brand_name_Gionee	1.951078
18	brand_name_Google	1.321728
19	brand_name_HTC	3.409740
20	brand_name_Honor	3.340510
21	brand_name_Huawei	5.983730
22	brand_name_Infinix	1.283812
23	brand_name_Karbonn	1.573272
24	brand_name_LG	4.849233
25	brand_name_Lava	1.710944
26	brand_name_Lenovo	4.558824
27	brand_name_Meizu	2.179601
28	brand_name_Micromax	3.363324
29	brand_name_Microsoft	1.869378
30	brand_name_Motorola	3.274439
31	brand_name_Nokia	3.469823
32	brand_name_OnePlus	1.436988
33	brand_name_Oppo	3.970869
34	brand_name_Others	9.710490
35	brand_name_Panasonic	2.105587
36	brand_name_Realme	1.946598
37	brand_name_Samsung	7.539660
38	brand_name_Sony	2.943075
39	brand_name_Spice	1.688091
40	brand_name_Vivo	3.651290
41	brand_name_XOLO	2.137802
42	brand_name_Xiaomi	3.719633
43	brand_name_ZTE	3.797137
44	os_Others	1.742135
45	os_Windows	1.593582
46	4g_yes	2.467353
47	5g_yes	1.805566