EasyVisa Project

Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:

  • Facilitate the process of visa approvals.
  • Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

Data Description

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

  • case_id: ID of each visa application
  • continent: Information of continent the employee
  • education_of_employee: Information of education of the employee
  • has_job_experience: Does the employee has any job experience? Y= Yes; N = No
  • requires_job_training: Does the employee require any job training? Y = Yes; N = No
  • no_of_employees: Number of employees in the employer's company
  • yr_of_estab: Year in which the employer's company was established
  • region_of_employment: Information of foreign worker's intended region of employment in the US.
  • prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
  • unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
  • full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
  • case_status: Flag indicating if the Visa was certified or denied

Importing necessary libraries and data

In [ ]:
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    StackingClassifier,
)

from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

from sklearn.model_selection import GridSearchCV
In [ ]:
## Importing from google drive

from google.colab import drive
drive.mount('/content/drive')
easyvisa = pd.read_csv('/content/drive/MyDrive/DSBA/Ensemble Techniques/EasyVisa Project/EasyVisa.csv')
Mounted at /content/drive
In [ ]:
data = easyvisa.copy()

Data Overview

  • Observations
  • Sanity checks
In [ ]:
data.head() ##  First 5 rows of the data
Out[ ]:
case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
0 EZYV01 Asia High School N N 14513 2007 West 592.2029 Hour Y Denied
1 EZYV02 Asia Master's Y N 2412 2002 Northeast 83425.6500 Year Y Certified
2 EZYV03 Asia Bachelor's N Y 44444 2008 West 122996.8600 Year Y Denied
3 EZYV04 Asia Bachelor's N N 98 1897 West 83434.0300 Year Y Denied
4 EZYV05 Africa Master's Y N 1082 2005 South 149907.3900 Year Y Certified
In [ ]:
data.tail() ##  Last 5 rows of the data
Out[ ]:
case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
25475 EZYV25476 Asia Bachelor's Y Y 2601 2008 South 77092.57 Year Y Certified
25476 EZYV25477 Asia High School Y N 3274 2006 Northeast 279174.79 Year Y Certified
25477 EZYV25478 Asia Master's Y N 1121 1910 South 146298.85 Year N Certified
25478 EZYV25479 Asia Master's Y Y 1918 1887 West 86154.77 Year Y Certified
25479 EZYV25480 Asia Bachelor's Y N 3195 1960 Midwest 70876.91 Year Y Certified
In [ ]:
data.shape ## Shape of data
Out[ ]:
(25480, 12)
In [ ]:
data.info() ## Types of columns of dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB

There are 3 numeric (float and int type) and 9 string (object type) columns in the data.

In [ ]:
data.duplicated().sum()  ## Checking for duplicate values
Out[ ]:
0

There are no duplicate values in the dataset.

Exploratory Data Analysis (EDA)

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.
In [ ]:
data.describe(include="all") ## Statistical summary of the data
Out[ ]:
case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
count 25480 25480 25480 25480 25480 25480.000000 25480.000000 25480 25480.000000 25480 25480 25480
unique 25480 6 4 2 2 NaN NaN 5 NaN 4 2 2
top EZYV01 Asia Bachelor's Y N NaN NaN Northeast NaN Year Y Certified
freq 1 16861 10234 14802 22525 NaN NaN 7195 NaN 22962 22773 17018
mean NaN NaN NaN NaN NaN 5667.043210 1979.409929 NaN 74455.814592 NaN NaN NaN
std NaN NaN NaN NaN NaN 22877.928848 42.366929 NaN 52815.942327 NaN NaN NaN
min NaN NaN NaN NaN NaN -26.000000 1800.000000 NaN 2.136700 NaN NaN NaN
25% NaN NaN NaN NaN NaN 1022.000000 1976.000000 NaN 34015.480000 NaN NaN NaN
50% NaN NaN NaN NaN NaN 2109.000000 1997.000000 NaN 70308.210000 NaN NaN NaN
75% NaN NaN NaN NaN NaN 3504.000000 2005.000000 NaN 107735.512500 NaN NaN NaN
max NaN NaN NaN NaN NaN 602069.000000 2016.000000 NaN 319210.270000 NaN NaN NaN

Most of the employment is from Northeast region.

Most of the employees have Bachelor's degree.

Most popular continent which applicants are from is Asia.

Average prevailing wage is $74456.

Most of the employees have job experience.

In [ ]:
## There are negative values for number of employees.

data.loc[data["no_of_employees"] < 0].shape ## Checking negative values
Out[ ]:
(33, 12)
In [ ]:
data["no_of_employees"] = abs(data["no_of_employees"]) ## Converting the negative values to a positive number
In [ ]:
cat_col = list(data.select_dtypes("object").columns) ## count of each unique value in each column

for column in cat_col:
    print(data[column].value_counts())
    print("-" * 50)
EZYV01       1
EZYV16995    1
EZYV16993    1
EZYV16992    1
EZYV16991    1
            ..
EZYV8492     1
EZYV8491     1
EZYV8490     1
EZYV8489     1
EZYV25480    1
Name: case_id, Length: 25480, dtype: int64
--------------------------------------------------
Asia             16861
Europe            3732
North America     3292
South America      852
Africa             551
Oceania            192
Name: continent, dtype: int64
--------------------------------------------------
Bachelor's     10234
Master's        9634
High School     3420
Doctorate       2192
Name: education_of_employee, dtype: int64
--------------------------------------------------
Y    14802
N    10678
Name: has_job_experience, dtype: int64
--------------------------------------------------
N    22525
Y     2955
Name: requires_job_training, dtype: int64
--------------------------------------------------
Northeast    7195
South        7017
West         6586
Midwest      4307
Island        375
Name: region_of_employment, dtype: int64
--------------------------------------------------
Year     22962
Hour      2157
Week       272
Month       89
Name: unit_of_wage, dtype: int64
--------------------------------------------------
Y    22773
N     2707
Name: full_time_position, dtype: int64
--------------------------------------------------
Certified    17018
Denied        8462
Name: case_status, dtype: int64
--------------------------------------------------
In [ ]:
data["case_id"].unique() ## check unique values in the case_id column
Out[ ]:
array(['EZYV01', 'EZYV02', 'EZYV03', ..., 'EZYV25478', 'EZYV25479',
       'EZYV25480'], dtype=object)
In [ ]:
data.drop(columns=['case_id'], axis=1, inplace=True) ## Dropping 'case_id' column from the data

Univariate Analysis

In [ ]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
histogram_boxplot(data, "no_of_employees")

There are a lot of outiers in the data.

In [ ]:
histogram_boxplot(data, 'prevailing_wage') 

Average prevailing wage is around 70000 and there are outliers in the data.

In [ ]:
data.loc[data["prevailing_wage"] < 100] ## Observations for data with less than 100 prevailing wage
Out[ ]:
continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
338 Asia Bachelor's Y N 2114 2012 Northeast 15.7716 Hour Y Certified
634 Asia Master's N N 834 1977 Northeast 3.3188 Hour Y Denied
839 Asia High School Y N 4537 1999 West 61.1329 Hour Y Denied
876 South America Bachelor's Y N 731 2004 Northeast 82.0029 Hour Y Denied
995 Asia Master's N N 302 2000 South 47.4872 Hour Y Certified
... ... ... ... ... ... ... ... ... ... ... ...
25023 Asia Bachelor's N Y 3200 1994 South 94.1546 Hour Y Denied
25258 Asia Bachelor's Y N 3659 1997 South 79.1099 Hour Y Denied
25308 North America Master's N N 82953 1977 Northeast 42.7705 Hour Y Denied
25329 Africa Bachelor's N N 2172 1993 Northeast 32.9286 Hour Y Denied
25461 Asia Master's Y N 2861 2004 West 54.9196 Hour Y Denied

176 rows × 11 columns

In [ ]:
data.loc[data["prevailing_wage"] < 100, "unit_of_wage"].count ## Count of values in the column
Out[ ]:
<bound method Series.count of 338      Hour
634      Hour
839      Hour
876      Hour
995      Hour
         ... 
25023    Hour
25258    Hour
25308    Hour
25329    Hour
25461    Hour
Name: unit_of_wage, Length: 176, dtype: object>
In [ ]:
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()
In [ ]:
## Plotting count of countinents

labeled_barplot(data, "continent", perc=True) 
In [ ]:
## Plotting count of Education of employees

labeled_barplot(data, "education_of_employee", perc=True) 
In [ ]:
## Plotting count of employees with job experience

labeled_barplot(data, "has_job_experience", perc=True) 
In [ ]:
## Plotting count of employees that require job training

labeled_barplot(data, "requires_job_training", perc=True) 
In [ ]:
## Plotting count of Regions of employment

labeled_barplot(data, "region_of_employment", perc=True)
In [ ]:
## Plotting count of Unit of wages

labeled_barplot(data, "unit_of_wage", perc=True) 
In [ ]:
## Plotting count of cases Certified and Denied

labeled_barplot(data, "case_status", perc=True) 

Bivariate Analysis

In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(10, 5))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
) 
plt.show() ## Correlation between variables
In [ ]:
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [ ]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

In [ ]:
stacked_barplot(data, "education_of_employee", "case_status")
case_status            Certified  Denied    All
education_of_employee                          
All                        17018    8462  25480
Bachelor's                  6367    3867  10234
High School                 1164    2256   3420
Master's                    7575    2059   9634
Doctorate                   1912     280   2192
------------------------------------------------------------------------------------------------------------------------

Yes, education does play a role in Visa certification. High School degree applicants seem to have more visa cases Denied and Doctorate degree applicants seem to have more Certified cases.

In [ ]:
plt.figure(figsize=(10, 5))
sns.heatmap(pd.crosstab(data["education_of_employee"], data["region_of_employment"]),
    annot=True,
    fmt="g",
    cmap="viridis"
) 

plt.ylabel("Education")
plt.xlabel("Region")
plt.show()  ## heatmap for the crosstab between education and region of employment
In [ ]:
stacked_barplot(data, "region_of_employment", "case_status")
case_status           Certified  Denied    All
region_of_employment                          
All                       17018    8462  25480
Northeast                  4526    2669   7195
West                       4100    2486   6586
South                      4913    2104   7017
Midwest                    3253    1054   4307
Island                      226     149    375
------------------------------------------------------------------------------------------------------------------------

How does the visa status vary across different continents?

In [ ]:
stacked_barplot(data, "continent", "case_status")
case_status    Certified  Denied    All
continent                              
All                17018    8462  25480
Asia               11012    5849  16861
North America       2037    1255   3292
Europe              2957     775   3732
South America        493     359    852
Africa               397     154    551
Oceania              122      70    192
------------------------------------------------------------------------------------------------------------------------

Europe seems to have more percentage of Cases certified and South America has more percentage of cases Denial.

Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

In [ ]:
stacked_barplot(data, "has_job_experience", "case_status")
case_status         Certified  Denied    All
has_job_experience                          
All                     17018    8462  25480
N                        5994    4684  10678
Y                       11024    3778  14802
------------------------------------------------------------------------------------------------------------------------

Yes, work experience influences visa status. Applicants with job experience have more cases Certified.

In [ ]:
stacked_barplot(data, "has_job_experience", "requires_job_training")
requires_job_training      N     Y    All
has_job_experience                       
All                    22525  2955  25480
N                       8988  1690  10678
Y                      13537  1265  14802
------------------------------------------------------------------------------------------------------------------------

The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

In [ ]:
distribution_plot_wrt_target(data, "prevailing_wage", "case_status")

Certified cases have wider range of Prevailing wages than Denied cases.

In [ ]:
plt.figure(figsize=(10, 5))
sns.boxplot(data['region_of_employment'], data['prevailing_wage'], showfliers=False,palette='PuBu') ## Complete the code to create boxplot for region of employment and prevailing wage
plt.show()

In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

In [ ]:
stacked_barplot(data, "unit_of_wage", "case_status")
case_status   Certified  Denied    All
unit_of_wage                          
All               17018    8462  25480
Year              16047    6915  22962
Hour                747    1410   2157
Week                169     103    272
Month                55      34     89
------------------------------------------------------------------------------------------------------------------------

Employees with Yearly paid unit of wage is most likely to be certified for a visa.

Data Preprocessing

Outlier Check

In [ ]:
## Outlier detection 
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable) 
plt.show()

Data Preparation for modeling

In [ ]:
## Dropping case_status column

data["case_status"] = data["case_status"].apply(lambda x: 1 if x == "Certified" else 0)

X = data.drop(["case_status"], axis=1)
Y = data["case_status"]

X = pd.get_dummies(X, drop_first=True) 

# Splitting data in train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state=1, stratify=Y) 
In [ ]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (17836, 21)
Shape of test set :  (7644, 21)
Percentage of classes in training set:
1    0.667919
0    0.332081
Name: case_status, dtype: float64
Percentage of classes in test set:
1    0.667844
0    0.332156
Name: case_status, dtype: float64

Model Evaluation

In [ ]:
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # Accuracy
    recall = recall_score(target, pred)  # Recall
    precision = precision_score(target, pred)  # Precision
    f1 = f1_score(target, pred)  # F1-score

    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Decision Tree - Model Building and Hyperparameter Tuning

Decision Tree Model

In [ ]:
model = DecisionTreeClassifier(random_state=1) 
model.fit(X_train,y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(model,X_train,y_train) 
In [ ]:
decision_tree_perf_train = model_performance_classification_sklearn(model,X_train,y_train) ## performance on train data
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
confusion_matrix_sklearn(model,X_test,y_test)
In [ ]:
decision_tree_perf_test = model_performance_classification_sklearn(model,X_test,y_test) ## performance for test data
decision_tree_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.664835 0.742801 0.752232 0.747487

Hyperparameter Tuning - Decision Tree

In [ ]:
dtree_estimator = DecisionTreeClassifier(class_weight="balanced", random_state=1)

parameters = {
    "max_depth": np.arange(5, 16, 5),
    "min_samples_leaf": [3, 5, 7],
    "max_leaf_nodes": [2, 5],
    "min_impurity_decrease": [0.0001, 0.001],
}

scorer = metrics.make_scorer(metrics.f1_score)

grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1) 

grid_obj = grid_obj.fit(X_train, y_train)

dtree_estimator = grid_obj.best_estimator_

dtree_estimator.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=2,
                       min_impurity_decrease=0.0001, min_samples_leaf=3,
                       random_state=1)
In [ ]:
confusion_matrix_sklearn(dtree_estimator,X_train,y_train)
In [ ]:
dtree_estimator_model_train_perf = model_performance_classification_sklearn(dtree_estimator,X_train,y_train) ## performance for train data on tuned estimator
dtree_estimator_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.712548 0.931923 0.720067 0.812411
In [ ]:
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)
In [ ]:
dtree_estimator_model_test_perf = model_performance_classification_sklearn(dtree_estimator,X_test,y_test) ## performance for test data on tuned estimator
dtree_estimator_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.706567 0.930852 0.715447 0.809058

Model is overfitting the data but after tuning, Recall is better than Accuracy.

Building bagging and boosting models

Bagging Classifier

In [ ]:
bagging_classifier = BaggingClassifier(random_state=1) 
bagging_classifier.fit(X_train,y_train)
Out[ ]:
BaggingClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(bagging_classifier,X_train,y_train) 
In [ ]:
bagging_classifier_model_train_perf = model_performance_classification_sklearn(bagging_classifier,X_train,y_train) ## performance on train data
bagging_classifier_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.985198 0.985982 0.99181 0.988887
In [ ]:
confusion_matrix_sklearn(bagging_classifier,X_test,y_test)
In [ ]:
bagging_classifier_model_test_perf = model_performance_classification_sklearn(bagging_classifier,X_test,y_test) ## performance for test data
bagging_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.691523 0.764153 0.771711 0.767913

Hyperparameter Tuning - Bagging Classifier

In [ ]:
bagging_estimator_tuned = BaggingClassifier(random_state=1)

parameters = {
    "max_samples": [0.7, 0.9],
    "max_features": [0.7, 0.9],
    "n_estimators": np.arange(90, 111, 10),
}

acc_scorer = metrics.make_scorer(metrics.f1_score)

grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=scorer,cv=5) 
grid_obj = grid_obj.fit(X_train, y_train) 

bagging_estimator_tuned = grid_obj.best_estimator_

bagging_estimator_tuned.fit(X_train, y_train)
Out[ ]:
BaggingClassifier(max_features=0.7, max_samples=0.7, n_estimators=100,
                  random_state=1)
In [ ]:
confusion_matrix_sklearn(bagging_estimator_tuned,X_train,y_train)
In [ ]:
bagging_estimator_tuned_model_train_perf = model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train) ## performance for train data on tuned estimator
bagging_estimator_tuned_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.996187 0.999916 0.994407 0.997154
In [ ]:
confusion_matrix_sklearn(bagging_estimator_tuned,X_test,y_test)
In [ ]:
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn(bagging_estimator_tuned,X_test,y_test) ## performance for test data on tuned estimator
bagging_estimator_tuned_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.724228 0.895397 0.743857 0.812622

After tuning, bagging classifier is showing better Accuracy and Recall.

Random Forest

In [ ]:
rf_estimator = RandomForestClassifier(class_weight={0:0.18,1:0.72},random_state=1) 
rf_estimator.fit(X_train,y_train) 
Out[ ]:
RandomForestClassifier(class_weight={0: 0.18, 1: 0.72}, random_state=1)
In [ ]:
confusion_matrix_sklearn(rf_estimator,X_train,y_train)
In [ ]:
rf_estimator_model_train_perf = model_performance_classification_sklearn(rf_estimator,X_train,y_train) ## performance on train data
rf_estimator_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
confusion_matrix_sklearn(rf_estimator,X_test,y_test) 
In [ ]:
rf_estimator_model_test_perf = model_performance_classification_sklearn(rf_estimator,X_test,y_test) ## performance for test data
rf_estimator_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.718472 0.820176 0.772367 0.795554

Hyperparameter Tuning - Random Forest

In [ ]:
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)

parameters = {
    "max_depth": list(np.arange(5, 15, 5)),
    "max_features": ["sqrt", "log2"],
    "min_samples_split": [5, 7],
    "n_estimators": np.arange(15, 26, 5),
}

acc_scorer = metrics.make_scorer(metrics.f1_score)

grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5,n_jobs=-1) 
grid_obj = grid_obj.fit(X_train, y_train) 

rf_tuned = grid_obj.best_estimator_


rf_tuned.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier(max_depth=10, max_features='sqrt', min_samples_split=7,
                       n_estimators=20, oob_score=True, random_state=1)
In [ ]:
confusion_matrix_sklearn(rf_tuned,X_train,y_train)
In [ ]:
rf_tuned_model_train_perf = model_performance_classification_sklearn(rf_tuned,X_train,y_train) ## performance for train data on tuned estimator
rf_tuned_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.769119 0.91866 0.776556 0.841652
In [ ]:
confusion_matrix_sklearn(rf_tuned,X_test,y_test) 
In [ ]:
rf_tuned_model_test_perf = model_performance_classification_sklearn(rf_tuned,X_test,y_test) ## performance for test data on tuned estimator
rf_tuned_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.738095 0.898923 0.755391 0.82093

Random Forest is overfitting the data. With hyperparameter tuning, Recall has better values.

Boosting - Model Building and Hyperparameter Tuning

AdaBoost Classifier

In [ ]:
ab_classifier = AdaBoostClassifier(random_state=1) 
ab_classifier.fit(X_train,y_train)
Out[ ]:
AdaBoostClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(ab_classifier,X_train,y_train)
In [ ]:
ab_classifier_model_train_perf = model_performance_classification_sklearn(ab_classifier,X_train,y_train) ## performance on train data
ab_classifier_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.738226 0.887182 0.760688 0.81908
In [ ]:
confusion_matrix_sklearn(ab_classifier,X_test,y_test) 
In [ ]:
ab_classifier_model_test_perf = model_performance_classification_sklearn(ab_classifier,X_test,y_test) ## performance for test data
ab_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.734301 0.885015 0.757799 0.816481

Hyperparameter Tuning - AdaBoost Classifier

In [ ]:
abc_tuned = AdaBoostClassifier(random_state=1)

parameters = {
    
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, class_weight="balanced", random_state=1),
        DecisionTreeClassifier(max_depth=2, class_weight="balanced", random_state=1),
    ],
    "n_estimators": np.arange(80, 101, 10),
    "learning_rate": np.arange(0.1, 0.4, 0.1),
}

acc_scorer = metrics.make_scorer(metrics.f1_score)

grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5) 
grid_obj = grid_obj.fit(X_train,y_train) 

abc_tuned = grid_obj.best_estimator_

abc_tuned.fit(X_train, y_train)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                         max_depth=1,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In [ ]:
confusion_matrix_sklearn(abc_tuned,X_train,y_train)
In [ ]:
abc_tuned_model_train_perf = model_performance_classification_sklearn(abc_tuned,X_train,y_train) ## performance for train data on tuned estimator
abc_tuned_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.718995 0.781247 0.794587 0.787861
In [ ]:
confusion_matrix_sklearn(abc_tuned,X_test,y_test)
In [ ]:
abc_tuned_model_test_perf = model_performance_classification_sklearn(abc_tuned,X_test,y_test) ## performance for test data on tuned estimator
abc_tuned_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.71651 0.781391 0.791468 0.786397

Tuning Adaboost hyperparameters does not seem to help much. Accuracy and Recall have good values.

Gradient Boosting Classifier

In [ ]:
gb_classifier = GradientBoostingClassifier(random_state=1) 
gb_classifier.fit(X_train,y_train)
Out[ ]:
GradientBoostingClassifier(random_state=1)
In [ ]:
confusion_matrix_sklearn(gb_classifier,X_train,y_train)
In [ ]:
gb_classifier_model_train_perf = model_performance_classification_sklearn(gb_classifier,X_train,y_train) ## performance on train data
gb_classifier_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.758802 0.88374 0.783042 0.830349
In [ ]:
confusion_matrix_sklearn(gb_classifier,X_test,y_test)
In [ ]:
gb_classifier_model_test_perf = model_performance_classification_sklearn(gb_classifier,X_test,y_test) ## performance for test data
gb_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.744767 0.876004 0.772366 0.820927

Hyperparameter Tuning - Gradient Boosting Classifier

In [ ]:
gbc_tuned = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=1), random_state=1
)

parameters = {
    "n_estimators": [200, 250],
    "subsample": [0.9, 1],
    "max_features": [0.8, 0.9],
    "learning_rate": np.arange(0.1, 0.21, 0.1),
}

acc_scorer = metrics.make_scorer(metrics.f1_score)

grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5) 
grid_obj = grid_obj.fit(X_train,y_train) 

gbc_tuned = grid_obj.best_estimator_

gbc_tuned.fit(X_train, y_train)
Out[ ]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.8, n_estimators=200, random_state=1,
                           subsample=1)
In [ ]:
confusion_matrix_sklearn(gbc_tuned,X_train,y_train)
In [ ]:
gbc_tuned_model_train_perf = model_performance_classification_sklearn(gbc_tuned,X_train,y_train) ## performance for train data on tuned estimator
gbc_tuned_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.764017 0.882649 0.789059 0.833234
In [ ]:
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)
In [ ]:
gbc_tuned_model_test_perf = model_performance_classification_sklearn(gbc_tuned,X_test,y_test) ## performance for test data on tuned estimator
gbc_tuned_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.743459 0.871303 0.773296 0.819379

Gardient boosting classifier seems to be a good model.

XGBoost Classifier

In [ ]:
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss') 
xgb_classifier.fit(X_train,y_train)
Out[ ]:
XGBClassifier(eval_metric='logloss', random_state=1)
In [ ]:
confusion_matrix_sklearn(xgb_classifier,X_train,y_train)
In [ ]:
xgb_classifier_model_train_perf = model_performance_classification_sklearn(xgb_classifier,X_train,y_train) ## performance on train data
xgb_classifier_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.756279 0.883573 0.780513 0.828852
In [ ]:
confusion_matrix_sklearn(xgb_classifier,X_test,y_test)
In [ ]:
xgb_classifier_model_test_perf = model_performance_classification_sklearn(xgb_classifier,X_test,y_test) ## performance for test data
xgb_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.744636 0.877375 0.771576 0.821082

Hyperparameter Tuning - XGBoost Classifier

In [ ]:
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")

parameters = {
    "n_estimators": np.arange(150, 250, 50),
    "scale_pos_weight": [1, 2],
    "subsample": [0.9, 1],
    "learning_rate": np.arange(0.1, 0.21, 0.1),
    "gamma": [3, 5],
    "colsample_bytree": [0.8, 0.9],
    "colsample_bylevel": [ 0.9, 1],
}

acc_scorer = metrics.make_scorer(metrics.f1_score)

grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5) 
grid_obj = grid_obj.fit(X_train,y_train) 

xgb_tuned = grid_obj.best_estimator_

xgb_tuned.fit(X_train, y_train)
Out[ ]:
XGBClassifier(colsample_bylevel=0.9, colsample_bytree=0.9,
              eval_metric='logloss', gamma=5, n_estimators=200, random_state=1)
In [ ]:
confusion_matrix_sklearn(xgb_tuned,X_train,y_train)
In [ ]:
xgb_tuned_model_train_perf = model_performance_classification_sklearn(xgb_tuned,X_train,y_train) ## performance for train data on tuned estimator
xgb_tuned_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.756504 0.883069 0.780995 0.828901
In [ ]:
confusion_matrix_sklearn(xgb_tuned,X_test,y_test)
In [ ]:
xgb_tuned_model_test_perf = model_performance_classification_sklearn(xgb_tuned,X_test,y_test) ## performance for test data on tuned estimator
xgb_tuned_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.744244 0.8762 0.771739 0.820659

Stacking Classifier

In [ ]:
estimators = [
    ("AdaBoost", ab_classifier),
    ("Gradient Boosting", gbc_tuned),
    ("Random Forest", rf_tuned),
]

final_estimator = xgb_tuned

stacking_classifier = StackingClassifier(estimators=estimators,final_estimator=final_estimator) 

stacking_classifier.fit(X_train,y_train) 
Out[ ]:
StackingClassifier(estimators=[('AdaBoost', AdaBoostClassifier(random_state=1)),
                               ('Gradient Boosting',
                                GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                                           max_features=0.8,
                                                           n_estimators=200,
                                                           random_state=1,
                                                           subsample=1)),
                               ('Random Forest',
                                RandomForestClassifier(max_depth=10,
                                                       max_features='sqrt',
                                                       min_samples_split=7,
                                                       n_estimators=20,
                                                       oob_score=True,
                                                       random_state=1))],
                   final_estimator=XGBClassifier(colsample_bylevel=0.9,
                                                 colsample_bytree=0.9,
                                                 eval_metric='logloss', gamma=5,
                                                 n_estimators=200,
                                                 random_state=1))
In [ ]:
confusion_matrix_sklearn(stacking_classifier,X_train,y_train) 
In [ ]:
stacking_classifier_model_train_perf = model_performance_classification_sklearn(stacking_classifier,X_train,y_train) ## performance on train data
stacking_classifier_model_train_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.767549 0.891211 0.788372 0.836643
In [ ]:
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)
In [ ]:
stacking_classifier_model_test_perf = model_performance_classification_sklearn(stacking_classifier,X_test,y_test) ## performance for test data
stacking_classifier_model_test_perf
Out[ ]:
Accuracy Recall Precision F1
0 0.744898 0.878355 0.771375 0.821396

Stacking classifier also seems to be a good model.

Model Performance Comparison and Conclusions

In [ ]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        dtree_estimator_model_train_perf.T,
        bagging_classifier_model_train_perf.T,
        bagging_estimator_tuned_model_train_perf.T,
        rf_estimator_model_train_perf.T,
        rf_tuned_model_train_perf.T,
        ab_classifier_model_train_perf.T,
        abc_tuned_model_train_perf.T,
        gb_classifier_model_train_perf.T,
        gbc_tuned_model_train_perf.T,
        xgb_classifier_model_train_perf.T,
        xgb_tuned_model_train_perf.T,
        stacking_classifier_model_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Tuned Decision Tree",
    "Bagging Classifier",
    "Tuned Bagging Classifier",
    "Random Forest",
    "Tuned Random Forest",
    "Adaboost Classifier",
    "Tuned Adaboost Classifier",
    "Gradient Boost Classifier",
    "Tuned Gradient Boost Classifier",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree Tuned Decision Tree Bagging Classifier Tuned Bagging Classifier Random Forest Tuned Random Forest Adaboost Classifier Tuned Adaboost Classifier Gradient Boost Classifier Tuned Gradient Boost Classifier XGBoost Classifier XGBoost Classifier Tuned Stacking Classifier
Accuracy 1.0 0.712548 0.985198 0.996187 1.0 0.769119 0.738226 0.718995 0.758802 0.764017 0.756279 0.756504 0.767549
Recall 1.0 0.931923 0.985982 0.999916 1.0 0.918660 0.887182 0.781247 0.883740 0.882649 0.883573 0.883069 0.891211
Precision 1.0 0.720067 0.991810 0.994407 1.0 0.776556 0.760688 0.794587 0.783042 0.789059 0.780513 0.780995 0.788372
F1 1.0 0.812411 0.988887 0.997154 1.0 0.841652 0.819080 0.787861 0.830349 0.833234 0.828852 0.828901 0.836643
In [ ]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        dtree_estimator_model_test_perf.T,
        bagging_classifier_model_test_perf.T,
        bagging_estimator_tuned_model_test_perf.T,
        rf_estimator_model_test_perf.T,
        rf_tuned_model_test_perf.T,
        ab_classifier_model_test_perf.T,
        abc_tuned_model_test_perf.T,
        gb_classifier_model_test_perf.T,
        gbc_tuned_model_test_perf.T,
        xgb_classifier_model_test_perf.T,
        xgb_tuned_model_test_perf.T,
        stacking_classifier_model_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree",
    "Tuned Decision Tree",
    "Bagging Classifier",
    "Tuned Bagging Classifier",
    "Random Forest",
    "Tuned Random Forest",
    "Adaboost Classifier",
    "Tuned Adaboost Classifier",
    "Gradient Boost Classifier",
    "Tuned Gradient Boost Classifier",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[ ]:
Decision Tree Tuned Decision Tree Bagging Classifier Tuned Bagging Classifier Random Forest Tuned Random Forest Adaboost Classifier Tuned Adaboost Classifier Gradient Boost Classifier Tuned Gradient Boost Classifier XGBoost Classifier XGBoost Classifier Tuned Stacking Classifier
Accuracy 0.664835 0.706567 0.691523 0.724228 0.718472 0.738095 0.734301 0.716510 0.744767 0.743459 0.744636 0.744244 0.744898
Recall 0.742801 0.930852 0.764153 0.895397 0.820176 0.898923 0.885015 0.781391 0.876004 0.871303 0.877375 0.876200 0.878355
Precision 0.752232 0.715447 0.771711 0.743857 0.772367 0.755391 0.757799 0.791468 0.772366 0.773296 0.771576 0.771739 0.771375
F1 0.747487 0.809058 0.767913 0.812622 0.795554 0.820930 0.816481 0.786397 0.820927 0.819379 0.821082 0.820659 0.821396

Tuned Adaboost classifier and Gradient Boost classifier seem to good models for the data provided because Accuracy< Recall, precision and F1 percentages are higher.

Important features of the final model

In [ ]:
feature_names = X_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Actionable Insights and Recommendations

Using the Tuned Adaboost classifier model or Gradient Boost classifier model, data can be more accurately predicted for Certification or Denial.

The best sutiable factors to be considered for applicants who visa should be ceritified is - Education (preferably Doctorate degree) Have Experience in Job Higher Prevailing wages Unit of wage paid (preferably Yearly)

The best sutiable factors to be considered for applicants who visa should be Denied is - Education (Lowest degree possible) if there is no Experience in Job lesser Prevailing wages Unit of wage paid (hourly)