Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/MyDrive/DSBA/Unsupervised Learning/Trade&Ahead/stock_data.csv')
print(f"There are {len(data.axes[0])} rows and {len(data.axes[1])} columns.")
data.sample(n=10, random_state=1)
data.info()
df = data.copy()
df.duplicated().sum()
There are no duplicate values in the data.
df.isnull().sum()
There are no missing values in the data.
df.describe(include='all').T
Questions:
Univariate Analysis
def histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
)
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
)
ax_hist2.axvline(
df[feature].median(), color="black", linestyle="-"
)
Observations on Current price -
histogram_boxplot(df, 'Current Price')
Observations on Price Change -
histogram_boxplot(df, 'Price Change')
Observations on Volatility -
histogram_boxplot(df, 'Volatility')
Observations on ROE -
histogram_boxplot(df, 'ROE')
Observations on Cash Ratio -
histogram_boxplot(df, 'Cash Ratio')
Observations on Net Cash Flow -
histogram_boxplot(df, 'Net Cash Flow')
Observations on Net Income -
histogram_boxplot(df, 'Net Income')
Observations on Earnings Per Share -
histogram_boxplot(df, 'Earnings Per Share')
Observations on Estimated Shares Outstanding -
histogram_boxplot(df, 'Estimated Shares Outstanding')
Observations on P/E Ratio -
histogram_boxplot(df, 'P/E Ratio')
Observations on P/B Ratio -
histogram_boxplot(df, 'P/B Ratio')
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature])
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
)
else:
label = p.get_height()
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
plt.show()
Observations on GICS Sector -
labeled_barplot(df, 'GICS Sector', perc=True)
Industrials and Financials seem to be the most dominant economic sectors.
labeled_barplot(df, 'GICS Sub Industry', perc=True)
Electric Utilities, Banks and Biotechnology seemed to be the most dominant GICS sub Industry.
Bivariate Analysis
plt.figure(figsize=(15, 7))
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Net Income and Estimated shares outstanding need to be better correlated than other attributes indicating that Net Income goes hand in hand with shares outstanding.
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
plt.xticks(rotation=90)
plt.show()
Healthcare and Consumer Staples sectors seem to have maximum Price increase.
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False)
plt.xticks(rotation=90)
plt.show()
Cash Ratio seems to be more across IT and Telecommunication services sectors.
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False)
plt.xticks(rotation=90)
plt.show()
P/E Ration seems to be more across Energy and Real Estate sectors.
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False)
plt.xticks(rotation=90)
plt.show()
Volatility seems to be more across Energy and Materials sectors.
Outlier Check
plt.figure(figsize=(15, 12))
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Scaling
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
Checking Elbow Plot
k_means_df = subset_scaled_df.copy()
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)
visualizer.show()
Checking Silhouette scores
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)
visualizer.show()
Appropriate value for K seems to be 8.
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Creating Final Model
kmeans = KMeans(n_clusters=8, random_state=1)
kmeans.fit(k_means_df)
df1 = df.copy()
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_
Cluster Profiling
km_cluster_profile = df1.groupby("KM_segments").mean()
km_cluster_profile["count_in_each_segment"] = (
df1.groupby("KM_segments")["Security"].count().values
)
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
for cl in df1["KM_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df1[df1["KM_segments"] == cl]["Security"].unique())
print()
df1.groupby(["KM_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
num_col = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df1, x="KM_segments", y=variable)
plt.tight_layout(pad=2.0)
We will look into clusters 0, 1, and 2 only because these clusters have more sectors in them.
Cluster 0
Cluster 1
Cluster 2
Computing Cophenetic Clustering
hc_df = subset_scaled_df.copy()
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Euclidean distance
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
We see that the cophenetic correlation is maximum with Euclidean distance and average linkage.
Checking Dendrograms
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
The cophenetic correlation is highest for average and average linkage methods.
We will move ahead with average linkage.
6 appears to be the appropriate number of clusters from the dendrogram for average linkage.
Creating model using sklearn
HCmodel = AgglomerativeClustering(n_clusters=8, affinity="euclidean", linkage="average") ## Complete the code to define the hierarchical clustering model
HCmodel.fit(hc_df)
df2 = df.copy()
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
Cluster Profiling
hc_cluster_profile = df2.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
We will look into clusters 0, 1, and 2 only because these clusters have more companies in them.
Cluster 0
There are 3 companies in this cluster.
Price Change and P/E Ratio are high for these companies.
Net Cash Flow is moderate.
Cluster 1
There are 2 companies in this cluster.
Net Cash Flow and Net Income are high for these companies.
Cluster 2
There are 330 companies in this cluster.
P/B Ratio and P/E Ratio are low for these companies.
Net Cash Flow is low.
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.