Project Business Statistics: E-News Express

Marks: 60

Problem Statement

Business Context

The advent of e-news, or electronic news, portals have offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.

E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.

[Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]

Objective

The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:

  1. Do the users spend more time on the new landing page than on the existing landing page?

  2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?

  3. Does the converted status depend on the preferred language? [Hint: Create a contingency table using the pandas.crosstab() function]

  4. Is the time spent on the new page the same for the different language users?

Data Dictionary

The data contains information regarding the interaction of users in both groups with the two versions of the landing page.

  1. user_id - Unique user ID of the person visiting the website

  2. group - Whether the user belongs to the first group (control) or the second group (treatment)

  3. landing_page - Whether the landing page is new or old

  4. time_spent_on_the_page - Time (in minutes) spent by the user on the landing page

  5. converted - Whether the user gets converted to a subscriber of the news portal or not

  6. language_preferred - Language chosen by the user to view the landing page

Import the necessary libraries

In [ ]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

# Library to help with statistical analysis
import scipy.stats as stats 

Loading the dataset

In [ ]:
df = pd.read_csv('abtest.csv')

Data Overview

The initial steps to get an overview of any dataset is to:

  • observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
  • get information about the number of rows and columns in the dataset
  • find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
  • check the statistical summary of the dataset to get an overview of the numerical columns of the data

Displaying the first few rows of the dataset

In [ ]:
df.head()
Out[ ]:
user_id group landing_page time_spent_on_the_page converted language_preferred
0 546592 control old 3.48 no Spanish
1 546468 treatment new 7.13 yes English
2 546462 treatment new 4.40 no Spanish
3 546567 control old 3.02 no French
4 546459 treatment new 4.75 yes Spanish

Displaying the last few rows of the dataset

In [ ]:
df.tail()
Out[ ]:
user_id group landing_page time_spent_on_the_page converted language_preferred
95 546446 treatment new 5.15 no Spanish
96 546544 control old 6.52 yes English
97 546472 treatment new 7.07 yes Spanish
98 546481 treatment new 6.20 yes Spanish
99 546483 treatment new 5.86 yes English
  • The dataset has been loaded successfully
  • It can be observed that except user_id and time_spent_on_the_page, rest all the variables are categorical in nature

Checking the shape of the dataset

In [ ]:
df.shape
Out[ ]:
(100, 6)
  • The dataset has 100 rows and 6 columns.

Checking the data types of the columns for the dataset

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 100 non-null    int64  
 1   group                   100 non-null    object 
 2   landing_page            100 non-null    object 
 3   time_spent_on_the_page  100 non-null    float64
 4   converted               100 non-null    object 
 5   language_preferred      100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB
  • There are a total of 100 non-null observations in each of the columns

  • There are 6 columns named 'user_id', 'group', 'landing_page', 'time_spent_on_the_page', 'converted', 'language_preferred' whose data types are int64, object, object, float64, object, object respectively

  • 'group', 'landing_page', 'converted', and 'language_preferred' are objects

Getting the statistical summary for the numerical variables

In [ ]:
df.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
user_id 100.0 546517.0000 52.295779 546443.00 546467.75 546492.500 546567.2500 546592.00
time_spent_on_the_page 100.0 5.3778 2.378166 0.19 3.88 5.415 7.0225 10.71
  • The user_id column contains the user ID of the person visiting the page.
  • The maximum time spent on the landing page is less than 11 minutes.
  • The mean time spent on the landing page is approximately 5.4 minutes.
  • The median time spent on the landing page is approximately 5.4 minutes.

Getting the statistical summary for the categorical variables

In [ ]:
df.describe(include = ['object']).T
Out[ ]:
count unique top freq
group 100 2 treatment 50
landing_page 100 2 new 50
converted 100 2 yes 54
language_preferred 100 3 French 34
  • Variables group, landing_page and converted have only two categories
  • It can be observed that the distribution of users in group and landing_page are equal
  • More than 50% of users are converted
  • There are three categories of languages preferred by users, among which Spanish has been preferred the most by users.

Check for missing values

In [ ]:
# missing value check
df.isna().sum()
Out[ ]:
user_id                   0
group                     0
landing_page              0
time_spent_on_the_page    0
converted                 0
language_preferred        0
dtype: int64
  • There are no missing values in any of the columns

Check for duplicate values

In [ ]:
df.duplicated().sum()
Out[ ]:
0
  • There are no duplicated entries in the dataset

Exploratory Data Analysis

Univariate Analysis

Let us first explore the numerical variables

Time spent on the page

In [ ]:
sns.histplot(data=df,x='time_spent_on_the_page')
plt.show()
sns.boxplot(data=df,x='time_spent_on_the_page')
plt.show()
  • The time spent on the page seems to have a fairly normal distribution.
  • There are no outliers in this column.

Let us now explore the categorical variables

Group

In [ ]:
df['group'].value_counts()
Out[ ]:
treatment    50
control      50
Name: group, dtype: int64
In [ ]:
sns.countplot(data=df,x='group')
plt.show()
  • The distribution of the number of users across the two groups are the same

Landing page

In [ ]:
df['landing_page'].value_counts()
Out[ ]:
new    50
old    50
Name: landing_page, dtype: int64
In [ ]:
sns.countplot(data=df,x='landing_page')
plt.show()
  • The distribution of the number of users across the two landing pages are the same.

Converted

In [ ]:
df['converted'].value_counts()
Out[ ]:
yes    54
no     46
Name: converted, dtype: int64
In [ ]:
sns.countplot(data=df,x='converted')
plt.show()
  • Overall, 54 users get converted and 46 users do not get converted after visiting the landing page

Preferred language

In [ ]:
df['language_preferred'].value_counts()
Out[ ]:
French     34
Spanish    34
English    32
Name: language_preferred, dtype: int64
In [ ]:
sns.countplot(data=df,x='language_preferred')
plt.show()
  • The distribution of the number of users across various preferred languages is fairly uniform
  • There are 3 unique preferred languages - English, French, and Spanish.
    Spanish and French both have been equally and mostly preferred by users

Bivariate Analysis

Landing page vs Time spent on the page

In [ ]:
plt.figure(figsize=(10,6))
sns.boxplot(data=df,x='landing_page',y='time_spent_on_the_page')
plt.show()
  • Overall, the time spent on the new page seems to be greater than the time spent on the old page.

Conversion status vs Time spent on the page

In [ ]:
plt.figure(figsize=(10,6))
sns.boxplot(data=df,x='converted',y='time_spent_on_the_page')
plt.show()
  • Overall, the users who get converted seem to spend more time on the page.

Preferred language vs Time spent on the page

In [ ]:
plt.figure(figsize=(8,8))
sns.boxplot(x = 'language_preferred', y = 'time_spent_on_the_page', showmeans = True, data = df)
plt.show()
  • The median time spent by the different language users are approximately equal.
  • The mean time spent by the different language users are approximately equal.

1. Do the users spend more time on the new landing page than the existing landing page?

Perform Visual Analysis

In [ ]:
# visual analysis of the time spent by users on the new and old landing pages
plt.figure(figsize=(8,6))
sns.boxplot(x = 'landing_page', y = 'time_spent_on_the_page', data = df)
plt.show()
  • We see that people spent more time on the new page than the old landing page

Let's perform a hypothesis test to see if there is enough statistical evidence to support our observation.

Step 1: Define the null and alternative hypotheses

$H_0:$ The mean time spent by the users on the new page is equal to the mean time spent by the users on the old page.
$H_a:$ The mean time spent by the users on the new page is greater than the mean time spent by the users on the old page.

Let $\mu_1$ and $\mu_2$ be the mean time spent by the users on the new and old page respectively. Then the above formulated hypotheses can be mathematically written as:

$H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 > \mu_2$

Step 2: Select Appropriate test

This is a one-tailed test concerning two population means from two independent populations. As the population standard deviations are unknown, the two sample independent t-test will be the appropriate test for this problem.

Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data

In [ ]:
time_spent_new = df[df['landing_page'] == 'new']['time_spent_on_the_page']
time_spent_old = df[df['landing_page'] == 'old']['time_spent_on_the_page']
In [ ]:
print('The sample standard deviation of the time spent on the new page is:', round(time_spent_new.std(),2))
print('The sample standard deviation of the time spent on the new page is:', round(time_spent_old.std(),2))
The sample standard deviation of the time spent on the new page is: 1.82
The sample standard deviation of the time spent on the new page is: 2.58

As the sample standard deviations are different, the population standard deviations may be assumed to be different.

Step 5: Calculate the p-value

In [ ]:
# import the required function
from scipy.stats import ttest_ind
# find the p-value
test_stat, p_value = ttest_ind(time_spent_new, time_spent_old, equal_var = False, alternative = 'greater')
print('The p-value is', p_value)
The p-value is 0.0001392381225166549

Step 6: Compare the p-value with $\alpha$

In [ ]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')
As the p-value 0.0001392381225166549 is less than the level of significance, we reject the null hypothesis.

Step 7: Draw inference

Since the p-value is less than the 5% significance level, we reject the null hypothesis. Hence, we have enough statistical evidence to say that the mean time spent by the users on the new page is greater than the mean time spent by the users on the old page.

2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?

Perform Visual Analysis

In [ ]:
# visual analysis of the conversion rate for the new page and the conversion rate for the old page
pd.crosstab(df['converted'],df['landing_page'],normalize='index').plot(kind="bar", figsize=(6,8),stacked=True)
plt.legend()
plt.show()

By observing the above plot, we can say that overall the number of users who get converted is more for the new page than the old page. Let's perform a hypothesis test to see if there is enough statistical evidence to say that the conversion rate for the new page is greater than the old page.

Step 1: Define the null and alternative hypotheses

$H_0:$ The conversion rate for the new page is equal to the conversion rate for the old page.
$H_a:$ The conversion rate for the new page is greater than the conversion rate for the old page.

Let $p_1$ and $p_2$ be the conversion rate for the new and old page respectively.

Mathematically, the above formulated hypotheses can be written as:

$H_0: p_1 = p_2 \\ H_a: p_1 > p_2$

Step 2: Select Appropriate test

This is a one-tailed test concerning two population proportions from two independent populations. Hence, the two-sample proportion z-test will be the appropriate test for this problem.

Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data

In [ ]:
new_converted = df[df['group'] == 'treatment']['converted'].value_counts()['yes']
old_converted = df[df['group'] == 'control']['converted'].value_counts()['yes']
print('The numbers of converted users for the new and old pages are {0} and {1} respectively'.format(new_converted, old_converted))
n_control = df.group.value_counts()['control'] # number of users in the control group
n_treatment = df.group.value_counts()['treatment'] #number of users in the treatment group
print('The numbers of users served the new and old pages are {0} and {1} respectively'.format(n_control, n_treatment ))
The numbers of converted users for the new and old pages are 33 and 21 respectively
The numbers of users served the new and old pages are 50 and 50 respectively

Step 5: Calculate the p-value

In [ ]:
# import the required function
from statsmodels.stats.proportion import proportions_ztest

# find the p-value
test_stat, p_value = proportions_ztest([new_converted, old_converted] , [n_treatment, n_control], alternative = 'larger')
print('The p-value is', p_value)
The p-value is 0.008026308204056278

Step 6: Compare the p-value with $\alpha$

In [ ]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')
As the p-value 0.008026308204056278 is less than the level of significance, we reject the null hypothesis.

Step 7: Draw inference

Since the p-value is less than the 5% significance level, we reject the null hypothesis. Hence, we have enough statistical evidence to say that the conversion rate for the new page is greater than the conversion rate for the old page.

3. Is the conversion and preferred language are independent or related?

Perform Visual Analysis

In [ ]:
# visual analysis of the dependency between conversion status and preferred langauge
pd.crosstab(df['converted'],df['language_preferred'],normalize='index').plot(kind="bar", figsize=(6,8),
                 stacked=True)
plt.legend()
plt.show()

The distribution of conversion status for English and French language users is not uniformly distributed. Let's perform the hypothesis test to check whether we have enough statistical evidence to say that the conversion status and preferred language are independent or not.

Step 1: Define the null and alternative hypotheses

$H_0:$ The converted status is independent of the preferred language.
$H_a:$ The converted status is not independent of the preferred language.

Step 2: Select Appropriate test

This is a problem of Chi-square test of independence, concerning the two independent categorical variables, converted status and preferred language.

Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data

In [ ]:
# create the contingency table showing the distribution of two categorical variables
contingency_table = pd.crosstab(df['converted'], df['language_preferred'])
contingency_table
Out[ ]:
language_preferred English French Spanish
converted
no 11 19 16
yes 21 15 18

Step 5: Calculate the p-value

In [ ]:
#import the required function
from scipy.stats import chi2_contingency
# use chi2_contingency() to find the p-value
chi_2, p_value, dof, exp_freq = chi2_contingency(contingency_table)
# print the p-value
print('The p-value is', p_value)
The p-value is 0.21298887487543447

Step 6: Compare the p-value with $\alpha$

In [ ]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')
As the p-value 0.21298887487543447 is greater than the level of significance, we fail to reject the null hypothesis.

Step 7: Draw inference

Since the p-value is greater than the 5% significance level, we fail to reject the null hypothesis. Hence, we do not have enough statistical evidence to say that the converted status depends on the preferred language.

4. Is the time spent on the new page same for the different language users?

Perform Visual Analysis

In [ ]:
# visual analysis of the mean time spent on the new page for different language users
plt.figure(figsize=(8,8))
# create a new DataFrame for users served the new page
df_new = df[df['landing_page'] == 'new']
sns.boxplot(x = 'language_preferred', y = 'time_spent_on_the_page', showmeans = True, data = df_new)
plt.show()
In [ ]:
# Checking the mean time spent on the new page for different language users
df_new.groupby(['language_preferred'])['time_spent_on_the_page'].mean()
Out[ ]:
language_preferred
English    6.663750
French     6.196471
Spanish    5.835294
Name: time_spent_on_the_page, dtype: float64

The mean time spent on the new page by English users is a bit higher than the mean time spent by French and Spanish users, but we need to test if this difference is statistically significant or not.

Step 1: Define the null and alternative hypotheses

$H_0:$ The mean times spent on the new page by English, French, and Spanish users are equal.
$H_a:$ At least one of the mean times spent on the new page by English, French, and Spanish users is unequal.

Step 2: Select Appropriate test

This is a problem, concerning three population means. One-way ANOVA could be the appropriate test here provided normality and equality of variance assumptions are verified.

  • For testing of normality, Shapiro-Wilk’s test is applied to the response variable.

  • For equality of variance, Levene test is applied to the response variable.

Shapiro-Wilk’s test

We will test the null hypothesis

$H_0:$ Time spent on the new page follows a normal distribution

against the alternative hypothesis

$H_a:$ Time spent on the new page does not follow a normal distribution

In [ ]:
# Assumption 1: Normality
# import the required function
from scipy.stats import shapiro

# find the p-value
w, p_value = shapiro(df_new['time_spent_on_the_page']) 
print('The p-value is', p_value)
The p-value is 0.8040016293525696

Since p-value of the test is very large than the 5% significance level, we fail to reject the null hypothesis that the response follows the normal distribution.

Levene’s test

We will test the null hypothesis

$H_0$: All the population variances are equal

against the alternative hypothesis

$H_a$: At least one variance is different from the rest

In [ ]:
#Assumption 2: Homogeneity of Variance
#import the required function
from scipy.stats import levene
statistic, p_value = levene( df_new[df_new['language_preferred']=="English"]['time_spent_on_the_page'], 
                             df_new[df_new['language_preferred']=="French"]['time_spent_on_the_page'], 
                             df_new[df_new['language_preferred']=="Spanish"]['time_spent_on_the_page'])
# find the p-value
print('The p-value is', p_value)
The p-value is 0.46711357711340173

Since the p-value is large than the 5% significance level, we fail to reject the null hypothesis of homogeneity of variances.

Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data

In [ ]:
time_spent_English = df_new[df_new['language_preferred']=="English"]['time_spent_on_the_page']
time_spent_French = df_new[df_new['language_preferred']=="French"]['time_spent_on_the_page']
time_spent_Spanish = df_new[df_new['language_preferred']=="Spanish"]['time_spent_on_the_page']

Step 5: Calculate the p-value

In [ ]:
# import the required function
from scipy.stats import f_oneway
# find the p-value
test_stat, p_value = f_oneway(time_spent_English, time_spent_French, time_spent_Spanish)
# print the p-value
print('The p-value is', p_value)
The p-value is 0.43204138694325955

Step 6: Compare the p-value with $\alpha$

In [ ]:
# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')
As the p-value 0.43204138694325955 is greater than the level of significance, we fail to reject the null hypothesis.

Step 7: Draw inference

Since the p-value is greater than the 5% significance level, we fail to reject the null hypothesis. Hence, we do not have enough statistical evidence to say that the mean times spent on the new page by English, French, and Spanish users differ to any meaningful degree.

Conclusion and Recommendations

  • The users spend more time on the new page.

    • This indicates that the outline & recommended content of the new page is more likely to keep customers engaged long enough to make a decision to subscribe
  • The conversion rate for the new page is greater than the conversion rate of the old page

    • This indicates that the new page is more likely to gather new subscribers than the existing page
  • The conversion status is independent of the preferred language

  • The time spent on the new page does not differ with the language of the content

    • This indicates that irrespective of the language, the outline & recommended content of the new page are engaging
  • It is recommended that the news company uses the new landing page to gather more subscribers

    • The business logic would be to design a page that people spend time in, conversion will follow