Project Business Statistics: E-News Express¶

Marks: 60

Problem Statement¶

Business Context¶

The advent of e-news, or electronic news, portals have offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.

E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.

[Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]

Objective¶

The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:

Do the users spend more time on the new landing page than on the existing landing page?
Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?
Does the converted status depend on the preferred language? [Hint: Create a contingency table using the pandas.crosstab() function]
Is the time spent on the new page the same for the different language users?

Data Dictionary¶

The data contains information regarding the interaction of users in both groups with the two versions of the landing page.

user_id - Unique user ID of the person visiting the website
group - Whether the user belongs to the first group (control) or the second group (treatment)
landing_page - Whether the landing page is new or old
time_spent_on_the_page - Time (in minutes) spent by the user on the landing page
converted - Whether the user gets converted to a subscriber of the news portal or not
language_preferred - Language chosen by the user to view the landing page

Import the necessary libraries¶

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

# Library to help with statistical analysis
import scipy.stats as stats

Loading the dataset¶

df = pd.read_csv('abtest.csv')

Data Overview¶

The initial steps to get an overview of any dataset is to:

observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
get information about the number of rows and columns in the dataset
find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
check the statistical summary of the dataset to get an overview of the numerical columns of the data

Displaying the first few rows of the dataset¶

df.head()

Displaying the last few rows of the dataset¶

df.tail()

The dataset has been loaded successfully
It can be observed that except user_id and time_spent_on_the_page, rest all the variables are categorical in nature

Checking the shape of the dataset¶

df.shape

(100, 6)

The dataset has 100 rows and 6 columns.

Checking the data types of the columns for the dataset¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 100 non-null    int64  
 1   group                   100 non-null    object 
 2   landing_page            100 non-null    object 
 3   time_spent_on_the_page  100 non-null    float64
 4   converted               100 non-null    object 
 5   language_preferred      100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB

There are a total of 100 non-null observations in each of the columns
There are 6 columns named 'user_id', 'group', 'landing_page', 'time_spent_on_the_page', 'converted', 'language_preferred' whose data types are int64, object, object, float64, object, object respectively

'group', 'landing_page', 'converted', and 'language_preferred' are objects

Getting the statistical summary for the numerical variables¶

df.describe().T

The user_id column contains the user ID of the person visiting the page.
The maximum time spent on the landing page is less than 11 minutes.
The mean time spent on the landing page is approximately 5.4 minutes.
The median time spent on the landing page is approximately 5.4 minutes.

Getting the statistical summary for the categorical variables¶

df.describe(include = ['object']).T

Variables group, landing_page and converted have only two categories
It can be observed that the distribution of users in group and landing_page are equal
More than 50% of users are converted
There are three categories of languages preferred by users, among which Spanish has been preferred the most by users.

Check for missing values¶

# missing value check
df.isna().sum()

user_id                   0
group                     0
landing_page              0
time_spent_on_the_page    0
converted                 0
language_preferred        0
dtype: int64

There are no missing values in any of the columns

Check for duplicate values¶

df.duplicated().sum()

0

There are no duplicated entries in the dataset

Exploratory Data Analysis¶

Univariate Analysis¶

Let us first explore the numerical variables

Time spent on the page¶

sns.histplot(data=df,x='time_spent_on_the_page')
plt.show()
sns.boxplot(data=df,x='time_spent_on_the_page')
plt.show()

The time spent on the page seems to have a fairly normal distribution.
There are no outliers in this column.

Let us now explore the categorical variables

Group¶

df['group'].value_counts()

treatment    50
control      50
Name: group, dtype: int64

sns.countplot(data=df,x='group')
plt.show()

The distribution of the number of users across the two groups are the same

Landing page¶

df['landing_page'].value_counts()

new    50
old    50
Name: landing_page, dtype: int64

sns.countplot(data=df,x='landing_page')
plt.show()

The distribution of the number of users across the two landing pages are the same.

Converted¶

df['converted'].value_counts()

yes    54
no     46
Name: converted, dtype: int64

sns.countplot(data=df,x='converted')
plt.show()

Overall, 54 users get converted and 46 users do not get converted after visiting the landing page

Preferred language¶

df['language_preferred'].value_counts()

French     34
Spanish    34
English    32
Name: language_preferred, dtype: int64

sns.countplot(data=df,x='language_preferred')
plt.show()

The distribution of the number of users across various preferred languages is fairly uniform
There are 3 unique preferred languages - English, French, and Spanish.
Spanish and French both have been equally and mostly preferred by users

Bivariate Analysis¶

Landing page vs Time spent on the page¶

plt.figure(figsize=(10,6))
sns.boxplot(data=df,x='landing_page',y='time_spent_on_the_page')
plt.show()

Overall, the time spent on the new page seems to be greater than the time spent on the old page.

Conversion status vs Time spent on the page¶

plt.figure(figsize=(10,6))
sns.boxplot(data=df,x='converted',y='time_spent_on_the_page')
plt.show()

Overall, the users who get converted seem to spend more time on the page.

Preferred language vs Time spent on the page¶

plt.figure(figsize=(8,8))
sns.boxplot(x = 'language_preferred', y = 'time_spent_on_the_page', showmeans = True, data = df)
plt.show()

The median time spent by the different language users are approximately equal.
The mean time spent by the different language users are approximately equal.

1. Do the users spend more time on the new landing page than the existing landing page?¶

Perform Visual Analysis¶

# visual analysis of the time spent by users on the new and old landing pages
plt.figure(figsize=(8,6))
sns.boxplot(x = 'landing_page', y = 'time_spent_on_the_page', data = df)
plt.show()

We see that people spent more time on the new page than the old landing page

Let's perform a hypothesis test to see if there is enough statistical evidence to support our observation.

Step 1: Define the null and alternative hypotheses¶

$H_0:$ The mean time spent by the users on the new page is equal to the mean time spent by the users on the old page.
$H_a:$ The mean time spent by the users on the new page is greater than the mean time spent by the users on the old page.

Let $\mu_1$ and $\mu_2$ be the mean time spent by the users on the new and old page respectively. Then the above formulated hypotheses can be mathematically written as:

$H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 > \mu_2$

Step 2: Select Appropriate test¶

This is a one-tailed test concerning two population means from two independent populations. As the population standard deviations are unknown, the two sample independent t-test will be the appropriate test for this problem.

Step 3: Decide the significance level¶

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data¶

time_spent_new = df[df['landing_page'] == 'new']['time_spent_on_the_page']
time_spent_old = df[df['landing_page'] == 'old']['time_spent_on_the_page']

print('The sample standard deviation of the time spent on the new page is:', round(time_spent_new.std(),2))
print('The sample standard deviation of the time spent on the new page is:', round(time_spent_old.std(),2))

The sample standard deviation of the time spent on the new page is: 1.82
The sample standard deviation of the time spent on the new page is: 2.58

As the sample standard deviations are different, the population standard deviations may be assumed to be different.

Step 5: Calculate the p-value¶

# import the required function
from scipy.stats import ttest_ind
# find the p-value
test_stat, p_value = ttest_ind(time_spent_new, time_spent_old, equal_var = False, alternative = 'greater')
print('The p-value is', p_value)

The p-value is 0.0001392381225166549

Step 6: Compare the p-value with $\alpha$¶

# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

As the p-value 0.0001392381225166549 is less than the level of significance, we reject the null hypothesis.

Step 7: Draw inference¶

Since the p-value is less than the 5% significance level, we reject the null hypothesis. Hence, we have enough statistical evidence to say that the mean time spent by the users on the new page is greater than the mean time spent by the users on the old page.

2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?¶

Perform Visual Analysis¶

# visual analysis of the conversion rate for the new page and the conversion rate for the old page
pd.crosstab(df['converted'],df['landing_page'],normalize='index').plot(kind="bar", figsize=(6,8),stacked=True)
plt.legend()
plt.show()

By observing the above plot, we can say that overall the number of users who get converted is more for the new page than the old page. Let's perform a hypothesis test to see if there is enough statistical evidence to say that the conversion rate for the new page is greater than the old page.

Step 1: Define the null and alternative hypotheses¶

$H_0:$ The conversion rate for the new page is equal to the conversion rate for the old page.
$H_a:$ The conversion rate for the new page is greater than the conversion rate for the old page.

Let $p_1$ and $p_2$ be the conversion rate for the new and old page respectively.

Mathematically, the above formulated hypotheses can be written as:

$H_0: p_1 = p_2 \\ H_a: p_1 > p_2$

Step 2: Select Appropriate test¶

This is a one-tailed test concerning two population proportions from two independent populations. Hence, the two-sample proportion z-test will be the appropriate test for this problem.

Step 3: Decide the significance level¶

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data¶

new_converted = df[df['group'] == 'treatment']['converted'].value_counts()['yes']
old_converted = df[df['group'] == 'control']['converted'].value_counts()['yes']
print('The numbers of converted users for the new and old pages are {0} and {1} respectively'.format(new_converted, old_converted))
n_control = df.group.value_counts()['control'] # number of users in the control group
n_treatment = df.group.value_counts()['treatment'] #number of users in the treatment group
print('The numbers of users served the new and old pages are {0} and {1} respectively'.format(n_control, n_treatment ))

The numbers of converted users for the new and old pages are 33 and 21 respectively
The numbers of users served the new and old pages are 50 and 50 respectively

Step 5: Calculate the p-value¶

# import the required function
from statsmodels.stats.proportion import proportions_ztest

# find the p-value
test_stat, p_value = proportions_ztest([new_converted, old_converted] , [n_treatment, n_control], alternative = 'larger')
print('The p-value is', p_value)

The p-value is 0.008026308204056278

Step 6: Compare the p-value with $\alpha$¶

# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

As the p-value 0.008026308204056278 is less than the level of significance, we reject the null hypothesis.

Step 7: Draw inference¶

Since the p-value is less than the 5% significance level, we reject the null hypothesis. Hence, we have enough statistical evidence to say that the conversion rate for the new page is greater than the conversion rate for the old page.

3. Is the conversion and preferred language are independent or related?¶

Perform Visual Analysis¶

# visual analysis of the dependency between conversion status and preferred langauge
pd.crosstab(df['converted'],df['language_preferred'],normalize='index').plot(kind="bar", figsize=(6,8),
                 stacked=True)
plt.legend()
plt.show()

The distribution of conversion status for English and French language users is not uniformly distributed. Let's perform the hypothesis test to check whether we have enough statistical evidence to say that the conversion status and preferred language are independent or not.

Step 1: Define the null and alternative hypotheses¶

$H_0:$ The converted status is independent of the preferred language.
$H_a:$ The converted status is not independent of the preferred language.

Step 2: Select Appropriate test¶

This is a problem of Chi-square test of independence, concerning the two independent categorical variables, converted status and preferred language.

Step 3: Decide the significance level¶

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data¶

# create the contingency table showing the distribution of two categorical variables
contingency_table = pd.crosstab(df['converted'], df['language_preferred'])
contingency_table

Step 5: Calculate the p-value¶

#import the required function
from scipy.stats import chi2_contingency
# use chi2_contingency() to find the p-value
chi_2, p_value, dof, exp_freq = chi2_contingency(contingency_table)
# print the p-value
print('The p-value is', p_value)

The p-value is 0.21298887487543447

Step 6: Compare the p-value with $\alpha$¶

# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

As the p-value 0.21298887487543447 is greater than the level of significance, we fail to reject the null hypothesis.

Step 7: Draw inference¶

Since the p-value is greater than the 5% significance level, we fail to reject the null hypothesis. Hence, we do not have enough statistical evidence to say that the converted status depends on the preferred language.

4. Is the time spent on the new page same for the different language users?¶

Perform Visual Analysis¶

# visual analysis of the mean time spent on the new page for different language users
plt.figure(figsize=(8,8))
# create a new DataFrame for users served the new page
df_new = df[df['landing_page'] == 'new']
sns.boxplot(x = 'language_preferred', y = 'time_spent_on_the_page', showmeans = True, data = df_new)
plt.show()

# Checking the mean time spent on the new page for different language users
df_new.groupby(['language_preferred'])['time_spent_on_the_page'].mean()

language_preferred
English    6.663750
French     6.196471
Spanish    5.835294
Name: time_spent_on_the_page, dtype: float64

The mean time spent on the new page by English users is a bit higher than the mean time spent by French and Spanish users, but we need to test if this difference is statistically significant or not.

Step 1: Define the null and alternative hypotheses¶

$H_0:$ The mean times spent on the new page by English, French, and Spanish users are equal.
$H_a:$ At least one of the mean times spent on the new page by English, French, and Spanish users is unequal.

Step 2: Select Appropriate test¶

This is a problem, concerning three population means. One-way ANOVA could be the appropriate test here provided normality and equality of variance assumptions are verified.

For testing of normality, Shapiro-Wilk’s test is applied to the response variable.
For equality of variance, Levene test is applied to the response variable.

Shapiro-Wilk’s test¶

We will test the null hypothesis

$H_0:$ Time spent on the new page follows a normal distribution

against the alternative hypothesis

$H_a:$ Time spent on the new page does not follow a normal distribution

# Assumption 1: Normality
# import the required function
from scipy.stats import shapiro

# find the p-value
w, p_value = shapiro(df_new['time_spent_on_the_page']) 
print('The p-value is', p_value)

The p-value is 0.8040016293525696

Since p-value of the test is very large than the 5% significance level, we fail to reject the null hypothesis that the response follows the normal distribution.

Levene’s test¶

We will test the null hypothesis

$H_0$: All the population variances are equal

against the alternative hypothesis

$H_a$: At least one variance is different from the rest

#Assumption 2: Homogeneity of Variance
#import the required function
from scipy.stats import levene
statistic, p_value = levene( df_new[df_new['language_preferred']=="English"]['time_spent_on_the_page'], 
                             df_new[df_new['language_preferred']=="French"]['time_spent_on_the_page'], 
                             df_new[df_new['language_preferred']=="Spanish"]['time_spent_on_the_page'])
# find the p-value
print('The p-value is', p_value)

The p-value is 0.46711357711340173

Since the p-value is large than the 5% significance level, we fail to reject the null hypothesis of homogeneity of variances.

Step 3: Decide the significance level¶

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data¶

time_spent_English = df_new[df_new['language_preferred']=="English"]['time_spent_on_the_page']
time_spent_French = df_new[df_new['language_preferred']=="French"]['time_spent_on_the_page']
time_spent_Spanish = df_new[df_new['language_preferred']=="Spanish"]['time_spent_on_the_page']

Step 5: Calculate the p-value¶

# import the required function
from scipy.stats import f_oneway
# find the p-value
test_stat, p_value = f_oneway(time_spent_English, time_spent_French, time_spent_Spanish)
# print the p-value
print('The p-value is', p_value)

The p-value is 0.43204138694325955

Step 6: Compare the p-value with $\alpha$¶

# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

As the p-value 0.43204138694325955 is greater than the level of significance, we fail to reject the null hypothesis.

Step 7: Draw inference¶

Since the p-value is greater than the 5% significance level, we fail to reject the null hypothesis. Hence, we do not have enough statistical evidence to say that the mean times spent on the new page by English, French, and Spanish users differ to any meaningful degree.

Conclusion and Recommendations¶

The users spend more time on the new page.
- This indicates that the outline & recommended content of the new page is more likely to keep customers engaged long enough to make a decision to subscribe
The conversion rate for the new page is greater than the conversion rate of the old page
- This indicates that the new page is more likely to gather new subscribers than the existing page
The conversion status is independent of the preferred language
The time spent on the new page does not differ with the language of the content
- This indicates that irrespective of the language, the outline & recommended content of the new page are engaging
It is recommended that the news company uses the new landing page to gather more subscribers
- The business logic would be to design a page that people spend time in, conversion will follow

	count	mean	std	min	25%	50%	75%	max
user_id	100.0	546517.0000	52.295779	546443.00	546467.75	546492.500	546567.2500	546592.00
time_spent_on_the_page	100.0	5.3778	2.378166	0.19	3.88	5.415	7.0225	10.71

	user_id	group	landing_page	time_spent_on_the_page	converted	language_preferred
0	546592	control	old	3.48	no	Spanish
1	546468	treatment	new	7.13	yes	English
2	546462	treatment	new	4.40	no	Spanish
3	546567	control	old	3.02	no	French
4	546459	treatment	new	4.75	yes	Spanish

	user_id	group	landing_page	time_spent_on_the_page	converted	language_preferred
95	546446	treatment	new	5.15	no	Spanish
96	546544	control	old	6.52	yes	English
97	546472	treatment	new	7.07	yes	Spanish
98	546481	treatment	new	6.20	yes	Spanish
99	546483	treatment	new	5.86	yes	English

	count	unique	top	freq
group	100	2	treatment	50
landing_page	100	2	new	50
converted	100	2	yes	54
language_preferred	100	3	French	34