Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:
The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
AdaBoostClassifier,
GradientBoostingClassifier,
StackingClassifier,
)
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
f1_score,
)
from sklearn.model_selection import GridSearchCV
## Importing from google drive
from google.colab import drive
drive.mount('/content/drive')
easyvisa = pd.read_csv('/content/drive/MyDrive/DSBA/Ensemble Techniques/EasyVisa Project/EasyVisa.csv')
data = easyvisa.copy()
data.head() ## First 5 rows of the data
data.tail() ## Last 5 rows of the data
data.shape ## Shape of data
data.info() ## Types of columns of dataset
There are 3 numeric (float and int type) and 9 string (object type) columns in the data.
data.duplicated().sum() ## Checking for duplicate values
There are no duplicate values in the dataset.
data.describe(include="all") ## Statistical summary of the data
Most of the employment is from Northeast region.
Most of the employees have Bachelor's degree.
Most popular continent which applicants are from is Asia.
Average prevailing wage is $74456.
Most of the employees have job experience.
## There are negative values for number of employees.
data.loc[data["no_of_employees"] < 0].shape ## Checking negative values
data["no_of_employees"] = abs(data["no_of_employees"]) ## Converting the negative values to a positive number
cat_col = list(data.select_dtypes("object").columns) ## count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("-" * 50)
data["case_id"].unique() ## check unique values in the case_id column
data.drop(columns=['case_id'], axis=1, inplace=True) ## Dropping 'case_id' column from the data
Univariate Analysis
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "no_of_employees")
There are a lot of outiers in the data.
histogram_boxplot(data, 'prevailing_wage')
Average prevailing wage is around 70000 and there are outliers in the data.
data.loc[data["prevailing_wage"] < 100] ## Observations for data with less than 100 prevailing wage
data.loc[data["prevailing_wage"] < 100, "unit_of_wage"].count ## Count of values in the column
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show()
## Plotting count of countinents
labeled_barplot(data, "continent", perc=True)
## Plotting count of Education of employees
labeled_barplot(data, "education_of_employee", perc=True)
## Plotting count of employees with job experience
labeled_barplot(data, "has_job_experience", perc=True)
## Plotting count of employees that require job training
labeled_barplot(data, "requires_job_training", perc=True)
## Plotting count of Regions of employment
labeled_barplot(data, "region_of_employment", perc=True)
## Plotting count of Unit of wages
labeled_barplot(data, "unit_of_wage", perc=True)
## Plotting count of cases Certified and Denied
labeled_barplot(data, "case_status", perc=True)
Bivariate Analysis
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(10, 5))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show() ## Correlation between variables
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?
stacked_barplot(data, "education_of_employee", "case_status")
Yes, education does play a role in Visa certification. High School degree applicants seem to have more visa cases Denied and Doctorate degree applicants seem to have more Certified cases.
plt.figure(figsize=(10, 5))
sns.heatmap(pd.crosstab(data["education_of_employee"], data["region_of_employment"]),
annot=True,
fmt="g",
cmap="viridis"
)
plt.ylabel("Education")
plt.xlabel("Region")
plt.show() ## heatmap for the crosstab between education and region of employment
stacked_barplot(data, "region_of_employment", "case_status")
How does the visa status vary across different continents?
stacked_barplot(data, "continent", "case_status")
Europe seems to have more percentage of Cases certified and South America has more percentage of cases Denial.
Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?
stacked_barplot(data, "has_job_experience", "case_status")
Yes, work experience influences visa status. Applicants with job experience have more cases Certified.
stacked_barplot(data, "has_job_experience", "requires_job_training")
The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?
distribution_plot_wrt_target(data, "prevailing_wage", "case_status")
Certified cases have wider range of Prevailing wages than Denied cases.
plt.figure(figsize=(10, 5))
sns.boxplot(data['region_of_employment'], data['prevailing_wage'], showfliers=False,palette='PuBu') ## Complete the code to create boxplot for region of employment and prevailing wage
plt.show()
In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?
stacked_barplot(data, "unit_of_wage", "case_status")
Employees with Yearly paid unit of wage is most likely to be certified for a visa.
Outlier Check
## Outlier detection
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Data Preparation for modeling
## Dropping case_status column
data["case_status"] = data["case_status"].apply(lambda x: 1 if x == "Certified" else 0)
X = data.drop(["case_status"], axis=1)
Y = data["case_status"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state=1, stratify=Y)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Model Evaluation
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # Accuracy
recall = recall_score(target, pred) # Recall
precision = precision_score(target, pred) # Precision
f1 = f1_score(target, pred) # F1-score
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Decision Tree Model
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train,y_train)
confusion_matrix_sklearn(model,X_train,y_train)
decision_tree_perf_train = model_performance_classification_sklearn(model,X_train,y_train) ## performance on train data
decision_tree_perf_train
confusion_matrix_sklearn(model,X_test,y_test)
decision_tree_perf_test = model_performance_classification_sklearn(model,X_test,y_test) ## performance for test data
decision_tree_perf_test
Hyperparameter Tuning - Decision Tree
dtree_estimator = DecisionTreeClassifier(class_weight="balanced", random_state=1)
parameters = {
"max_depth": np.arange(5, 16, 5),
"min_samples_leaf": [3, 5, 7],
"max_leaf_nodes": [2, 5],
"min_impurity_decrease": [0.0001, 0.001],
}
scorer = metrics.make_scorer(metrics.f1_score)
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
dtree_estimator = grid_obj.best_estimator_
dtree_estimator.fit(X_train, y_train)
confusion_matrix_sklearn(dtree_estimator,X_train,y_train)
dtree_estimator_model_train_perf = model_performance_classification_sklearn(dtree_estimator,X_train,y_train) ## performance for train data on tuned estimator
dtree_estimator_model_train_perf
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)
dtree_estimator_model_test_perf = model_performance_classification_sklearn(dtree_estimator,X_test,y_test) ## performance for test data on tuned estimator
dtree_estimator_model_test_perf
Model is overfitting the data but after tuning, Recall is better than Accuracy.
Bagging Classifier
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train,y_train)
confusion_matrix_sklearn(bagging_classifier,X_train,y_train)
bagging_classifier_model_train_perf = model_performance_classification_sklearn(bagging_classifier,X_train,y_train) ## performance on train data
bagging_classifier_model_train_perf
confusion_matrix_sklearn(bagging_classifier,X_test,y_test)
bagging_classifier_model_test_perf = model_performance_classification_sklearn(bagging_classifier,X_test,y_test) ## performance for test data
bagging_classifier_model_test_perf
Hyperparameter Tuning - Bagging Classifier
bagging_estimator_tuned = BaggingClassifier(random_state=1)
parameters = {
"max_samples": [0.7, 0.9],
"max_features": [0.7, 0.9],
"n_estimators": np.arange(90, 111, 10),
}
acc_scorer = metrics.make_scorer(metrics.f1_score)
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
bagging_estimator_tuned = grid_obj.best_estimator_
bagging_estimator_tuned.fit(X_train, y_train)
confusion_matrix_sklearn(bagging_estimator_tuned,X_train,y_train)
bagging_estimator_tuned_model_train_perf = model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train) ## performance for train data on tuned estimator
bagging_estimator_tuned_model_train_perf
confusion_matrix_sklearn(bagging_estimator_tuned,X_test,y_test)
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn(bagging_estimator_tuned,X_test,y_test) ## performance for test data on tuned estimator
bagging_estimator_tuned_model_test_perf
After tuning, bagging classifier is showing better Accuracy and Recall.
Random Forest
rf_estimator = RandomForestClassifier(class_weight={0:0.18,1:0.72},random_state=1)
rf_estimator.fit(X_train,y_train)
confusion_matrix_sklearn(rf_estimator,X_train,y_train)
rf_estimator_model_train_perf = model_performance_classification_sklearn(rf_estimator,X_train,y_train) ## performance on train data
rf_estimator_model_train_perf
confusion_matrix_sklearn(rf_estimator,X_test,y_test)
rf_estimator_model_test_perf = model_performance_classification_sklearn(rf_estimator,X_test,y_test) ## performance for test data
rf_estimator_model_test_perf
Hyperparameter Tuning - Random Forest
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)
parameters = {
"max_depth": list(np.arange(5, 15, 5)),
"max_features": ["sqrt", "log2"],
"min_samples_split": [5, 7],
"n_estimators": np.arange(15, 26, 5),
}
acc_scorer = metrics.make_scorer(metrics.f1_score)
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
rf_tuned = grid_obj.best_estimator_
rf_tuned.fit(X_train, y_train)
confusion_matrix_sklearn(rf_tuned,X_train,y_train)
rf_tuned_model_train_perf = model_performance_classification_sklearn(rf_tuned,X_train,y_train) ## performance for train data on tuned estimator
rf_tuned_model_train_perf
confusion_matrix_sklearn(rf_tuned,X_test,y_test)
rf_tuned_model_test_perf = model_performance_classification_sklearn(rf_tuned,X_test,y_test) ## performance for test data on tuned estimator
rf_tuned_model_test_perf
Random Forest is overfitting the data. With hyperparameter tuning, Recall has better values.
Boosting - Model Building and Hyperparameter Tuning
AdaBoost Classifier
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)
confusion_matrix_sklearn(ab_classifier,X_train,y_train)
ab_classifier_model_train_perf = model_performance_classification_sklearn(ab_classifier,X_train,y_train) ## performance on train data
ab_classifier_model_train_perf
confusion_matrix_sklearn(ab_classifier,X_test,y_test)
ab_classifier_model_test_perf = model_performance_classification_sklearn(ab_classifier,X_test,y_test) ## performance for test data
ab_classifier_model_test_perf
Hyperparameter Tuning - AdaBoost Classifier
abc_tuned = AdaBoostClassifier(random_state=1)
parameters = {
"base_estimator": [
DecisionTreeClassifier(max_depth=1, class_weight="balanced", random_state=1),
DecisionTreeClassifier(max_depth=2, class_weight="balanced", random_state=1),
],
"n_estimators": np.arange(80, 101, 10),
"learning_rate": np.arange(0.1, 0.4, 0.1),
}
acc_scorer = metrics.make_scorer(metrics.f1_score)
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train,y_train)
abc_tuned = grid_obj.best_estimator_
abc_tuned.fit(X_train, y_train)
confusion_matrix_sklearn(abc_tuned,X_train,y_train)
abc_tuned_model_train_perf = model_performance_classification_sklearn(abc_tuned,X_train,y_train) ## performance for train data on tuned estimator
abc_tuned_model_train_perf
confusion_matrix_sklearn(abc_tuned,X_test,y_test)
abc_tuned_model_test_perf = model_performance_classification_sklearn(abc_tuned,X_test,y_test) ## performance for test data on tuned estimator
abc_tuned_model_test_perf
Tuning Adaboost hyperparameters does not seem to help much. Accuracy and Recall have good values.
Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
confusion_matrix_sklearn(gb_classifier,X_train,y_train)
gb_classifier_model_train_perf = model_performance_classification_sklearn(gb_classifier,X_train,y_train) ## performance on train data
gb_classifier_model_train_perf
confusion_matrix_sklearn(gb_classifier,X_test,y_test)
gb_classifier_model_test_perf = model_performance_classification_sklearn(gb_classifier,X_test,y_test) ## performance for test data
gb_classifier_model_test_perf
Hyperparameter Tuning - Gradient Boosting Classifier
gbc_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
parameters = {
"n_estimators": [200, 250],
"subsample": [0.9, 1],
"max_features": [0.8, 0.9],
"learning_rate": np.arange(0.1, 0.21, 0.1),
}
acc_scorer = metrics.make_scorer(metrics.f1_score)
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train,y_train)
gbc_tuned = grid_obj.best_estimator_
gbc_tuned.fit(X_train, y_train)
confusion_matrix_sklearn(gbc_tuned,X_train,y_train)
gbc_tuned_model_train_perf = model_performance_classification_sklearn(gbc_tuned,X_train,y_train) ## performance for train data on tuned estimator
gbc_tuned_model_train_perf
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)
gbc_tuned_model_test_perf = model_performance_classification_sklearn(gbc_tuned,X_test,y_test) ## performance for test data on tuned estimator
gbc_tuned_model_test_perf
Gardient boosting classifier seems to be a good model.
XGBoost Classifier
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)
confusion_matrix_sklearn(xgb_classifier,X_train,y_train)
xgb_classifier_model_train_perf = model_performance_classification_sklearn(xgb_classifier,X_train,y_train) ## performance on train data
xgb_classifier_model_train_perf
confusion_matrix_sklearn(xgb_classifier,X_test,y_test)
xgb_classifier_model_test_perf = model_performance_classification_sklearn(xgb_classifier,X_test,y_test) ## performance for test data
xgb_classifier_model_test_perf
Hyperparameter Tuning - XGBoost Classifier
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")
parameters = {
"n_estimators": np.arange(150, 250, 50),
"scale_pos_weight": [1, 2],
"subsample": [0.9, 1],
"learning_rate": np.arange(0.1, 0.21, 0.1),
"gamma": [3, 5],
"colsample_bytree": [0.8, 0.9],
"colsample_bylevel": [ 0.9, 1],
}
acc_scorer = metrics.make_scorer(metrics.f1_score)
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train,y_train)
xgb_tuned = grid_obj.best_estimator_
xgb_tuned.fit(X_train, y_train)
confusion_matrix_sklearn(xgb_tuned,X_train,y_train)
xgb_tuned_model_train_perf = model_performance_classification_sklearn(xgb_tuned,X_train,y_train) ## performance for train data on tuned estimator
xgb_tuned_model_train_perf
confusion_matrix_sklearn(xgb_tuned,X_test,y_test)
xgb_tuned_model_test_perf = model_performance_classification_sklearn(xgb_tuned,X_test,y_test) ## performance for test data on tuned estimator
xgb_tuned_model_test_perf
Stacking Classifier
estimators = [
("AdaBoost", ab_classifier),
("Gradient Boosting", gbc_tuned),
("Random Forest", rf_tuned),
]
final_estimator = xgb_tuned
stacking_classifier = StackingClassifier(estimators=estimators,final_estimator=final_estimator)
stacking_classifier.fit(X_train,y_train)
confusion_matrix_sklearn(stacking_classifier,X_train,y_train)
stacking_classifier_model_train_perf = model_performance_classification_sklearn(stacking_classifier,X_train,y_train) ## performance on train data
stacking_classifier_model_train_perf
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)
stacking_classifier_model_test_perf = model_performance_classification_sklearn(stacking_classifier,X_test,y_test) ## performance for test data
stacking_classifier_model_test_perf
Stacking classifier also seems to be a good model.
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
dtree_estimator_model_train_perf.T,
bagging_classifier_model_train_perf.T,
bagging_estimator_tuned_model_train_perf.T,
rf_estimator_model_train_perf.T,
rf_tuned_model_train_perf.T,
ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,
gb_classifier_model_train_perf.T,
gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,
stacking_classifier_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Tuned Decision Tree",
"Bagging Classifier",
"Tuned Bagging Classifier",
"Random Forest",
"Tuned Random Forest",
"Adaboost Classifier",
"Tuned Adaboost Classifier",
"Gradient Boost Classifier",
"Tuned Gradient Boost Classifier",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
dtree_estimator_model_test_perf.T,
bagging_classifier_model_test_perf.T,
bagging_estimator_tuned_model_test_perf.T,
rf_estimator_model_test_perf.T,
rf_tuned_model_test_perf.T,
ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,
gb_classifier_model_test_perf.T,
gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,
stacking_classifier_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Tuned Decision Tree",
"Bagging Classifier",
"Tuned Bagging Classifier",
"Random Forest",
"Tuned Random Forest",
"Adaboost Classifier",
"Tuned Adaboost Classifier",
"Gradient Boost Classifier",
"Tuned Gradient Boost Classifier",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Tuned Adaboost classifier and Gradient Boost classifier seem to good models for the data provided because Accuracy< Recall, precision and F1 percentages are higher.
Important features of the final model
feature_names = X_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Using the Tuned Adaboost classifier model or Gradient Boost classifier model, data can be more accurately predicted for Certification or Denial.
The best sutiable factors to be considered for applicants who visa should be ceritified is - Education (preferably Doctorate degree) Have Experience in Job Higher Prevailing wages Unit of wage paid (preferably Yearly)
The best sutiable factors to be considered for applicants who visa should be Denied is - Education (Lowest degree possible) if there is no Experience in Job lesser Prevailing wages Unit of wage paid (hourly)