1 Car Evaluation
2 Importing the necessary libraries
3 Loading the data set
4 Dataset Information
5 Attribute Information
6 Data Preprocessing
- 6.1 Converting to Dataframe Format
- 6.2 Converting columns to categorical data types
7 Univariate Analysis : Categorical Variables
8 Defining the Problem
9 Approach for imbalanced datasets
10 Applying Machine Learning Algorithms for classification
11 Conclusion

Car Evaluation¶

This dataset has been downloaded from UC Irvine Machine Learning Repository.

This dataset is regarding evaluation of cars.
The target variable/label is car acceptability and has four categories : unacceptable, acceptable, good and very good.

The input attributes fall under two broad categories - Price and Technical Characteristics.

Under Price, the attributes are buying price and maintenance price.
Under Technical characteristics, the attributes are doors, persons, size of luggage boot and safety.

We have identified : this is an imbalanced dataset with skewed class (output category/label) proportions.

The objective is here to build a model to give multiclass classifier model based on the input attributes.

Summary of Key information

Number of Instances/training examples          : 1728  
Number of Instances with missing attributes    :    0  
Number of qualified Instances/training examples :   0

Number of Input Attributes                     :  6
Number of categorical attributes               :  6
Number of numerical attributes                 :  0

Target Attribute Type                          : Multi class label
Target Class distribution                      : 70%:22%:3.9%:3.7%
Problem Identification                         : Multiclass Classification with imbalanced data set

Importing the necessary libraries¶

# Data Wrangling, inspection 
import numpy as np
import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt

# Data preprocessing 
import category_encoders as ce 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OrdinalEncoder

# sklearn ml models
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC

# Evaluation metrics
from sklearn.metrics import recall_score, precision_score, \
accuracy_score, plot_confusion_matrix, classification_report, f1_score

Loading the data set¶

pathname = "/Users/bhaskarroy/BHASKAR FILES/BHASKAR CAREER/Career/Skills/Data Science/Practise/Python/UCI Machine Learning Repository/car"
path0 = "/car.c45-names"
path1 = "/car.data"
path2 = "/car.names"


pathdata = pathname + path1
pathcolname = pathname + path0
pathdatadesc = pathname + path2

Dataset Information¶

with open(pathdatadesc) as f:
    print(f.read())

1. Title: Car Evaluation Database

2. Sources:
   (a) Creator: Marko Bohanec
   (b) Donors: Marko Bohanec   (marko.bohanec@ijs.si)
               Blaz Zupan      (blaz.zupan@ijs.si)
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

   CAR                      car acceptability
   . PRICE                  overall price
   . . buying               buying price
   . . maint                price of the maintenance
   . TECH                   technical characteristics
   . . COMFORT              comfort
   . . . doors              number of doors
   . . . persons            capacity in terms of persons to carry
   . . . lug_boot           the size of luggage boot
   . . safety               estimated safety of the car

   Input attributes are printed in lowercase. Besides the target
   concept (CAR), the model includes three intermediate concepts:
   PRICE, TECH, COMFORT. Every concept is in the original model
   related to its lower level descendants by a set of examples (for
   these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

   The Car Evaluation Database contains examples with the structural
   information removed, i.e., directly relates CAR to the six input
   attributes: buying, maint, doors, persons, lug_boot, safety.

   Because of known underlying concept structure, this database may be
   particularly useful for testing constructive induction and
   structure discovery methods.

5. Number of Instances: 1728
   (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %)

Attribute Information¶

with open(pathcolname) as f:
    print(f.read())

| names file (C4.5 format) for car evaluation domain

| class values

unacc, acc, good, vgood

| attributes

buying:   vhigh, high, med, low.
maint:    vhigh, high, med, low.
doors:    2, 3, 4, 5more.
persons:  2, 4, more.
lug_boot: small, med, big.
safety:   low, med, high.

Data Preprocessing¶

We will prepare the data for :

Exploratory Data analysis (EDA) and
for model building

Following actions were undertaken:

Converting to Dataframe Format
Inspect if any missing values present
Handling Missing values : There are no missing values. Hnece, entire data can be considered for model building
Processing Categorical Attributes : categorical attributes have been converted to categorical data type for EDA.
Processing Continous Attributes : not applicable as both the input and output attributes are categorical.

Converting to Dataframe Format¶

colnames = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data = pd.read_csv(pathdata, names = colnames, index_col = False)
data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB

data.describe()

data.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
class       object
dtype: object

# Inspect if any missing values present
data.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

Converting columns to categorical data types¶

for i in data.columns :
    print(f'{i} : {data[i].unique().tolist()}')

buying : ['vhigh', 'high', 'med', 'low']
maint : ['vhigh', 'high', 'med', 'low']
doors : ['2', '3', '4', '5more']
persons : ['2', '4', 'more']
lug_boot : ['small', 'med', 'big']
safety : ['low', 'med', 'high']
class : ['unacc', 'acc', 'vgood', 'good']

cat_vars = data.columns

catdict = {
  "buying":   ['low','med','high', 'vhigh' ],
    "maint":    ['low','med','high', 'vhigh' ],
    "doors":    ['2', '3', '4', '5more'],
    "persons" :  ['2', '4', 'more'],
    "lug_boot" : ['small', 'med', 'big'],
    "safety" :  ['low', 'med', 'high'],
    "class":['unacc', 'acc','good', 'v-good']
}

for i in cat_vars :
    data[i] = pd.Categorical(data[i], 
                             categories=catdict[i], ordered=True)

data.dtypes

buying      category
maint       category
doors       category
persons     category
lug_boot    category
safety      category
class       category
dtype: object

def show(data):
  for i in data.columns[0:]:
    print("Feature: {} with {} Levels".format(i,data[i].unique()))

show(data)

Feature: buying with ['vhigh', 'high', 'med', 'low']
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh'] Levels
Feature: maint with ['vhigh', 'high', 'med', 'low']
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh'] Levels
Feature: doors with ['2', '3', '4', '5more']
Categories (4, object): ['2' < '3' < '4' < '5more'] Levels
Feature: persons with ['2', '4', 'more']
Categories (3, object): ['2' < '4' < 'more'] Levels
Feature: lug_boot with ['small', 'med', 'big']
Categories (3, object): ['small' < 'med' < 'big'] Levels
Feature: safety with ['low', 'med', 'high']
Categories (3, object): ['low' < 'med' < 'high'] Levels
Feature: class with ['unacc', 'acc', NaN, 'good']
Categories (3, object): ['unacc' < 'acc' < 'good'] Levels

data.isnull().sum()

buying       0
maint        0
doors        0
persons      0
lug_boot     0
safety       0
class       65
dtype: int64

Univariate Analysis : Categorical Variables¶

#Accessing colors from external library Palettable
from palettable.cartocolors.qualitative import Bold_10 
colors = Bold_10.mpl_colors

#colors = plt.cm.Dark2(range(15))
#colors = plt.cm.tab20(range(15))

#.colors attribute for listed colormaps
#colors = plt.cm.tab10.colors 
#colors = plt.cm.Paired.colors

# custom function for easy and efficient analysis of categorical univariate
def UVA_category(data_frame, var_group = [], **kargs):

  '''
  Stands for Univariate_Analysis_categorical
  takes a group of variables (category) and plot/print all the value_counts and horizontal barplot.

  - data_frame : The Dataframe
  - var_group : The list of column names for univariate plots need to be plotted

  The keyword arguments are as follows :
  - col_count : The number of columns in the plot layout. Default value set is 2.
  For instance, if there are are 4# columns in var_group, then 4# univariate plots will be plotted in 2X2 layout.
  - colwidth : width of each plot
  - rowheight : height of each plot
  - normalize : Whether to present absolute values or percentage
  - sort_by : Whether to sort the bars by descending order of values

  - axlabel_fntsize : Fontsize of x axis and y axis labels
  - infofntsize : Fontsize in the info of unique value counts
  - axlabel_fntsize : fontsize of axis labels
  - axticklabel_fntsize : fontsize of axis tick labels
  - infofntfamily : Font family of info of unique value counts.
  Choose font family belonging to Monospace for multiline alignment.
  Some choices are : 'Consolas', 'Courier','Courier New', 'Lucida Sans Typewriter','Lucidatypewriter','Andale Mono'
  https://www.tutorialbrain.com/css_tutorial/css_font_family_list/
  - max_val_counts : Number of unique values for which count should be displayed
  - nspaces : Length of each line for the multiline strings in the info area for value_counts
  - ncountspaces : Length allocated to the count value for the unique values in the info area
  - show_percentage : Whether to show percentage of total for each unique value count
  Also check link for formatting syntax : https://pyformat.info/#number
  '''

  import textwrap
  data = data_frame.copy(deep = True)
  # Using dictionary with default values of keywrod arguments
  params_plot = dict(colcount = 2, colwidth = 7, rowheight = 4, normalize = False, sort_by = "Values")
  params_fontsize =  dict(axlabel_fntsize = 10,axticklabel_fntsize = 8, infofntsize = 10)
  params_fontfamily = dict(infofntfamily = 'Andale Mono')
  params_max_val_counts = dict(max_val_counts = 10)
  params_infospaces = dict(nspaces = 10, ncountspaces = 4)
  params_show_percentage = dict(show_percentage = True)



  # Updating the dictionary with parameter values passed while calling the function
  params_plot.update((k, v) for k, v in kargs.items() if k in params_plot)
  params_fontsize.update((k, v) for k, v in kargs.items() if k in params_fontsize)
  params_fontfamily.update((k, v) for k, v in kargs.items() if k in params_fontfamily)
  params_max_val_counts.update((k, v) for k, v in kargs.items() if k in params_max_val_counts)
  params_infospaces.update((k, v) for k, v in kargs.items() if k in params_infospaces)
  params_show_percentage.update((k, v) for k, v in kargs.items() if k in params_show_percentage)

  #params = dict(**params_plot, **params_fontsize)

  # Initialising all the possible keyword arguments of doc string with updated values
  colcount = params_plot['colcount']
  colheight = params_plot['colheight']
  rowheight = params_plot['rowheight']
  normalize = params_plot['normalize']
  sort_by = params_plot['sort_by']

  axlabel_fntsize = params_fontsize['axlabel_fntsize']
  axticklabel_fntsize = params_fontsize['axticklabel_fntsize']
  infofntsize = params_fontsize['infofntsize']
  infofntfamily = params_fontfamily['infofntfamily']
  max_val_counts =  params_max_val_counts['max_val_counts']
  nspaces = params_infospaces['nspaces']
  ncountspaces = params_infospaces['ncountspaces']
  show_percentage = params_show_percentage['show_percentage']

  if len(var_group) == 0:
        var_group = df.select_dtypes(exclude = ['number']).columns.to_list()

  import matplotlib.pyplot as plt
  plt.rcdefaults()
  # setting figure_size
  size = len(var_group)
  #rowcount = 1
  #colcount = size//rowcount+(size%rowcount != 0)*1


  colcount = colcount
  #print(colcount)
  rowcount = size//colcount+(size%colcount != 0)*1

  plt.figure(figsize = (colwidth*colcount,rowheight*rowcount), dpi = 150)


  # Converting the filtered columns as categorical
  for i in var_group:
        #data[i] = data[i].astype('category')
        data[i] = pd.Categorical(data[i])


  # for every variable
  for j,i in enumerate(var_group):
    #print('{} : {}'.format(j,i))
    norm_count = data[i].value_counts(normalize = normalize).sort_index()
    n_uni = data[i].nunique()

    if sort_by == "Values":
        norm_count = data[i].value_counts(normalize = normalize).sort_values(ascending = False)
        n_uni = data[i].nunique()


  #Plotting the variable with every information
    plt.subplot(rowcount,colcount,j+1)
    sns.barplot(x = norm_count, y = norm_count.index , order = norm_count.index)

    if normalize == False :
        plt.xlabel('count', fontsize = axlabel_fntsize )
    else :
        plt.xlabel('fraction/percent', fontsize = axlabel_fntsize )
    plt.ylabel('{}'.format(i), fontsize = axlabel_fntsize )

    ax = plt.gca()

    # textwrapping
    ax.set_yticklabels([textwrap.fill(str(e), 20) for e in norm_count.index], fontsize = axticklabel_fntsize)

    #print(n_uni)
    #print(type(norm_count.round(2)))

    # Functions to convert the pairing of unique values and value_counts into text string
    # Function to break a word into multiline string of fixed width per line
    def paddingString(word, nspaces = 20):
        i = len(word)//nspaces \
            +(len(word)%nspaces > 0)*(len(word)//nspaces > 0)*1 \
            + (len(word)//nspaces == 0)*1
        strA = ""
        for j in range(i-1):
            strA = strA+'\n'*(len(strA)>0)+ word[j*nspaces:(j+1)*nspaces]

        # insert appropriate number of white spaces
        strA = strA + '\n'*(len(strA)>0)*(i>1)+word[(i-1)*nspaces:] \
               + " "*(nspaces-len(word)%nspaces)*(len(word)%nspaces > 0)
        return strA

    # Function to convert Pandas series into multi line strings
    def create_string_for_plot(ser, nspaces = nspaces, ncountspaces = ncountspaces, \
                              show_percentage =  show_percentage):
        '''
        - nspaces : Length of each line for the multiline strings in the info area for value_counts
        - ncountspaces : Length allocated to the count value for the unique values in the info area
        - show_percentage : Whether to show percentage of total for each unique value count
        Also check link for formatting syntax : https://pyformat.info/#number
        '''
        str_text = ""
        for index, value in ser.items():
            str_tmp = paddingString(str(index), nspaces)+ " : " \
                      + " "*(ncountspaces-len(str(value)))*(len(str(value))<= ncountspaces) \
                      + str(value) \
                      + (" | " + "{:4.1f}%".format(value/ser.sum()*100))*show_percentage


            str_text = str_text + '\n'*(len(str_text)>0) + str_tmp
        return str_text

    #print(create_string_for_plot(norm_count.round(2)))

    #Ensuring a maximum of 10 unique values displayed
    if norm_count.round(2).size <= max_val_counts:
        text = '{}\nn_uniques = {}\nvalue counts\n{}' \
                .format(i, n_uni,create_string_for_plot(norm_count.round(2)))
        ax.annotate(text = text,
                    xy = (1.1, 1), xycoords = ax.transAxes,
                    ha = 'left', va = 'top', fontsize = infofntsize, fontfamily = infofntfamily)
    else :
        text = '{}\nn_uniques = {}\nvalue counts of top 10\n{}' \
                .format(i, n_uni,create_string_for_plot(norm_count.round(2)[0:max_val_counts]))
        ax.annotate(text = text,
                    xy = (1.1, 1), xycoords = ax.transAxes,
                    ha = 'left', va = 'top', fontsize = infofntsize, fontfamily = infofntfamily)


    plt.gcf().tight_layout()

Overview of all the categorical variables¶

from eda import eda_overview
eda_overview.UVA_category(data, data.columns,
                          rowheight = 3, normalize = False);

Categorical features : Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

from eda import eda_overview
eda_overview.UVA_category(data, data.columns,
                          rowheight = 3, normalize = True);

Categorical features : Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

from eda import composite_plots
composite_plots.bar_counts(data, data.columns)

For the car to be acceptable, it has to be low atleast in one of the pricing parameters - maintenance or buying price

dv = "buying"
df = pd.crosstab([data[dv]],[data['class']])
df.head()
df.plot(kind='bar', stacked = True, title=dv )
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.ylabel('Class Frequency')

Text(0, 0.5, 'Class Frequency')

data.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

from eda.composite_plots import features_plots
features_plots(data, ['buying', 'maint'], 'class')

pd.crosstab(index = [data['buying'],data['maint']], columns = data['class'], margins = False).transpose()

data.pivot_table(index=['buying','maint'], aggfunc='size')

buying  maint
low     low      108
        med      108
        high     108
        vhigh    108
med     low      108
        med      108
        high     108
        vhigh    108
high    low      108
        med      108
        high     108
        vhigh    108
vhigh   low      108
        med      108
        high     108
        vhigh    108
dtype: int64

df3 = data.groupby(["buying", "maint"]).size().reset_index(name="value_count")
sns.barplot(x = 'buying', y = 'value_count', hue = 'maint', data = df3)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

<matplotlib.legend.Legend at 0x7fac94fdab50>

CAR : car acceptability

PRICE : overall price
- buying : buying price
- maint : price of the maintenance
TECH : technical characteristics
- COMFORT comfort
- doors number of doors
- persons capacity in terms of persons to carry -lug_boot the size of luggage boot
safety estimated safety of the car

# Random inspection - price vs class
pd.set_option('display.max_rows', 20)
df4 = pd.DataFrame({'count' : data.groupby(by = ['buying','maint','class']).size()}).reset_index()
df4.style.background_gradient(cmap='Blues')
#https://stackoverflow.com/questions/12286607/making-heatmap-from-pandas-dataframe

df4

#pd.DataFrame(data[['buying','maint','class']].value_counts())

Price vs Class heatmap¶

Price has two components - buying price and maintenance.

Data is equally distributed for buying price (4 levels) and maintenance(4 levels) combinations.
Class instances of good and very good have buying price and maintenance ranging from low and medium.

Buying Price vs Class¶

pd.crosstab(index = [data['buying']], columns = data['class'], margins = False)

from eda.composite_plots import heatmap_plot
heatmap_plot(data, index = ['buying'], column = ['class'])

Maintenance vs Class¶

pd.crosstab(index = [data['maint']], columns = data['class'], margins = False)

fig, ax = plt.subplots(figsize=(4,3))
#font_kwds = dict(fontsize = 12)

df_MainVsClass = pd.crosstab(index = [data['maint']], columns = data['class'], margins = False)

sns.heatmap(df_MainVsClass)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 12) 
plt.xlabel(ax.get_xlabel(), fontsize = 15, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 15, fontweight = 'heavy')

Text(20.72222222222222, 0.5, 'maint')

Buying vs Maintenance¶

df_BuyingVsMaintenance = pd.crosstab(index = [data['buying']], columns = data['maint'], margins = False)

df_BuyingVsMaintenance

fig, ax = plt.subplots(figsize=(4,3))
#font_kwds = dict(fontsize = 12)
                       
sns.heatmap(df_BuyingVsMaintenance)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 12) 
plt.xlabel(ax.get_xlabel(), fontsize = 15, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 15, fontweight = 'heavy')

Text(20.72222222222222, 0.5, 'buying')

Price (Buying Price and Maintenance) vs Class¶

df_PricingVsClass = pd.crosstab(index = [data['buying'],data['maint']], columns = data['class'], margins = False)

df_PricingVsClass

fig, ax = plt.subplots(figsize=(5,4))
#font_kwds = dict(fontsize = 12)
sns.heatmap(df_PricingVsClass)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 10) 
plt.xlabel(ax.get_xlabel(), fontsize = 8, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 8, fontweight = 'heavy')

Text(33.222222222222214, 0.5, 'buying-maint')

Comfort vs Class Heatmaps¶

Comfort has three aspects - doors, luggage boot size and persons

All two persons car are classified as Unacceptable.
No specific pattern emerging from EDA of Comfort vs Class.

from eda.composite_plots import heatmap_plot
heatmap_plot(data, ['lug_boot'], ['class'])

pd.crosstab([data['doors'], data['lug_boot']], [data['class']])

from eda.composite_plots import heatmap_plot
heatmap_plot(data, ['doors','lug_boot'], ['class'])

data.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

Defining the Problem¶

This is a multiclass classification problem.
Data is imbalanced as all the classes are not equally represented.
Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.
Instead, we can use below performance measures :

Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
Precision: A measure of a classifiers exactness.
Recall: A measure of a classifiers completeness
F1 Score (or F-score): A weighted average of precision and recall.

Appropriateness of the performance measures :

Accuracy : Appropriate for Balanced datasets
Precision : Appropriate when minimising false positive is the focus
Recall : Appropriate when minimising false negatives is the focus.
F-measure provides a way to combine both precision and recall into a single measure that captures both properties.

Note : It is common practice to measure efficacy of binary classification models using the Area under the Curve (AUC) of the ROC curve. Multi class classicfication will require some tweaks in the ROC AUC approach. We are not pursuing that in this notebook. Read this for more info :https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

http://www.svds.com/learning-imbalanced-classes/

from eda.axes_utils import Add_valuecountsinfo, Add_data_labels

ax = data['class'].value_counts().plot(kind = 'bar', figsize=(3,3))
ax.yaxis.grid(True, alpha = 0.3)
ax.set(axisbelow = True)
Add_data_labels(ax.patches)
Add_valuecountsinfo(ax, 'class', data)

from eda import axes_utils
print(list(dir(axes_utils)))

['Add_data_labels', 'Add_valuecountsinfo', 'Change_barWidth', 'Highlight_Top_n_values', 'Set_axes_labels_titles', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'blended_transform_factory', 'create_string_for_plot', 'get_hist', 'gridspec', 'mdates', 'np', 'paddingString', 'pd', 'plt', 'sns', 'stylize_axes', 'time']

Approach for imbalanced datasets¶

Techniques to correctly distinguish the minority class can be categorized into four main groups, depending on how they deal with the problem. The four groups are :

Algorithm level approaches (also called internal), try to adapt existing classifier learning algorithms to bias the learning toward the minority class. In order to perform the adaptation a special knowledge of both the corresponding classifier and the application domain is required so as to comprehend why the classifier fails when the class distribution is uneven. More details about these types of methods are given in Chap. 6.

Data level (or external) approaches aim at rebalancing the class distribution by resampling the data space . This way, the modification of the learning algorithm is avoided since the effect caused by imbalance is decreased with a preprocessing step. These methods are discussed in depth in Chap. 5.

Cost-sensitive learning framework falls between data and algorithm level approaches. Both data level transformations (by adding costs to instances) and algorithm level modifications (by modifying the learning process to accept costs) [13, 48, 86] are incorporated. The classifier is biased toward the minority class by assuming higher misclassification costs for this class and seeking to minimize the total cost errors of both classes. An overview of cost-sensitive approaches for the class imbalance problem is presented Chap. 4.

Ensemble-based methods usually consist of a combination between an ensemble learning algorithm [59] and one of the techniques above, specifically, data level and cost-sensitive ones [27]. Adding a data level approach to the ensemble learning algorithm, the new hybrid method usually preprocesses the data before training each classifier, whereas cost-sensitive ensembles instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. Ensemble-based models are thoroughly described in Chap. 7.

Learning from imbalanced datasets by Alberto Fernandez et al

Applying Machine Learning Algorithms for classification¶

Training and testing the data¶

X_cols = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
Y_cols = ['class']

X = data[X_cols]
Y = data[Y_cols]

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)

X_train

Preparing the data¶

# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

Train (1296, 6) (1296, 1)
Test (432, 6) (432, 1)

#Verifying order of columns
#cat = ["buying","maint","doors","persons","lug_boot","safety"]
print(X_train.columns)
print(X_test.columns)

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')
Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')

from sklearn.preprocessing import OrdinalEncoder

#https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py

# prepare input data
def prepare_inputs(X_train, X_test): 
    oe = OrdinalEncoder(categories = [['low','med','high', 'vhigh' ],
                         ['low','med','high', 'vhigh' ],
                         ['2', '3', '4', '5more'],
                         ['2', '4', 'more'],
                         ['small', 'med', 'big'],
                         ['low', 'med', 'high']]
                       )
    oe.fit(X_train)
    X_train_enc = oe.transform(X_train) 
    X_test_enc = oe.transform(X_test) 
    
    return X_train_enc, X_test_enc

from sklearn.preprocessing import LabelEncoder

# prepare input data
def prepare_targets(y_train, y_test): 
    le = LabelEncoder()
    le.fit(np.ravel(y_train))
    y_train_enc = le.transform(np.ravel(y_train)) 
    y_test_enc = le.transform(np.ravel(y_test)) 
    
    return y_train_enc, y_test_enc

from sklearn.preprocessing import OrdinalEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Feature Selection (Optional)¶

Comparison of below three scenarios do not justify feature selection methods.

Accuracy when all features are considered : 81.02%
Accuracy when 4 features selected through "information gain" considered
Accuracy when 4 features selected through "Chi Squared" considered

Note we have applied Logistic regression classifier on all three scenarios to check any significant movement in the accuracy scores.

# fit the model using all the features
model = LogisticRegression(solver='lbfgs', max_iter = 1000)
model.fit(X_train_enc, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_enc)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 81.02

example of mutual information feature selection for categorical data¶

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

# feature selection
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=mutual_info_classif, k=4) 
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))

Feature 0: 0.053196
Feature 1: 0.022979
Feature 2: 0.000000
Feature 3: 0.146334
Feature 4: 0.005870
Feature 5: 0.179178

# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_) 
plt.show()

# fit the model
model = LogisticRegression(solver='lbfgs',max_iter= 1000)
model.fit(X_train_fs, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 81.25

example of chi squared feature selection for categorical data¶

from sklearn.feature_selection import chi2
# feature selection
def select_features2(X_train, y_train, X_test): 
    fs = SelectKBest(score_func=chi2, k=4)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

# feature selection
X_train_fs2, X_test_fs2, fs2 = select_features2(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs2.scores_)):
    print('Feature %d: %f' % (i, fs2.scores_[i]))

# plot the scores
plt.bar([i for i in range(len(fs2.scores_))], fs2.scores_) 
plt.show()

Feature 0: 107.312113
Feature 1: 71.158014
Feature 2: 5.761583
Feature 3: 136.535729
Feature 4: 27.666582
Feature 5: 197.004190

# fit the model
model = LogisticRegression(solver='lbfgs',max_iter= 1000)
model.fit(X_train_fs2, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_fs2)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 81.25

Popular algorithms for multi-class classification include:¶

k-Nearest Neighbors
Decision Trees
Naive Bayes
Random Forest
Gradient Boosting

Useful links :

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

# Sample code for calculating time

st_time = time.time()

#code.......
#code.......

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

Total time: 0.00s

# Refer to section 8 for the evaluation metrics

def evaluation_parametrics(y_train,yp_train,y_test,yp_test):
  print("--------------------------------------------------------------------------")
  print("Classification Report for Train Data")
  print(classification_report(y_train, yp_train))
  print("Classification Report for Test Data")
  print(classification_report(y_test, yp_test))
  print("--------------------------------------------------------------------------")
  # Accuracy
  print("Accuracy on Train Data is: {}".format(round(accuracy_score(y_train,yp_train),2)))
  print("Accuracy on Test Data is: {}".format(round(accuracy_score(y_test,yp_test),2)))
  print("--------------------------------------------------------------------------")
  # Precision
  print("Precision on Train Data is: {}".format(round(precision_score(y_train,yp_train,average = "weighted"),2)))
  print("Precision on Test Data is: {}".format(round(precision_score(y_test,yp_test,average = "weighted"),2)))
  print("--------------------------------------------------------------------------")
  # Recall 
  print("Recall on Train Data is: {}".format(round(recall_score(y_train,yp_train,average = "weighted"),2)))
  print("Recall on Test Data is: {}".format(round(recall_score(y_test,yp_test,average = "weighted"),2)))
  print("--------------------------------------------------------------------------")
  # F1 Score
  print("F1 Score on Train Data is: {}".format(round(f1_score(y_train,yp_train,average = "weighted"),2)))
  print("F1 Score on Test Data is: {}".format(round(f1_score(y_test,yp_test,average = "weighted"),2)))
  print("--------------------------------------------------------------------------")

# Creating custom method to populate a dictionary object with evaluation scores of the classifiers 

def create_dict(model, modelname, y_train, yp_train, y_test, yp_test):
    dict1 = {modelname :  {"F1" : {"Train": float(np.round(f1_score(y_train,yp_train,average = "weighted"),2)),
                                  "Test": float(np.round(f1_score(y_test,yp_test,average = "weighted"),2))},
                            "Recall": {"Train": float(np.round(recall_score(y_train,yp_train,average = "weighted"),2)),
                                       "Test": float(np.round(recall_score(y_test,yp_test,average = "weighted"),2))},
                            "Precision" :{"Train": float(np.round(precision_score(y_train,yp_train,average = "weighted"),2)),
                                        "Test": float(np.round(precision_score(y_test,yp_test,average = "weighted"),2))
                                       }}
                          
            }
    return dict1

dict = {}

def plot_learning_curve(
    estimator,
    title,
    X,
    y,
    axes=None,
    ylim=None,
    cv=None,
    n_jobs=None,
    train_sizes=np.linspace(0.1, 1.0, 5),
):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(
        train_sizes,
        train_scores_mean - train_scores_std,
        train_scores_mean + train_scores_std,
        alpha=0.1,
        color="r",
    )
    axes[0].fill_between(
        train_sizes,
        test_scores_mean - test_scores_std,
        test_scores_mean + test_scores_std,
        alpha=0.1,
        color="g",
    )
    axes[0].plot(
        train_sizes, train_scores_mean, "o-", color="r", label="Training score"
    )
    axes[0].plot(
        train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
    )
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, "o-")
    axes[1].fill_between(
        train_sizes,
        fit_times_mean - fit_times_std,
        fit_times_mean + fit_times_std,
        alpha=0.1,
    )
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    fit_time_argsort = fit_times_mean.argsort()
    fit_time_sorted = fit_times_mean[fit_time_argsort]
    test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
    test_scores_std_sorted = test_scores_std[fit_time_argsort]
    axes[2].grid()
    axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
    axes[2].fill_between(
        fit_time_sorted,
        test_scores_mean_sorted - test_scores_std_sorted,
        test_scores_mean_sorted + test_scores_std_sorted,
        alpha=0.1,
    )
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

Logistic Regression Classifier¶

lr = LogisticRegression(max_iter = 1000,random_state = 48, multi_class = 'multinomial')

st_time = time.time()
lr.fit(X_train_enc,y_train_enc)

yp_train_enc = lr.predict(X_train_enc)
yp_test_enc = lr.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(lr,X_test_enc, y_test_enc)

dict1 = create_dict(lr, "Logistic Regression Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(lr, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Logistic Regression Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

Total time: 0.06s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.68      0.63      0.66       296
           1       0.60      0.41      0.49        51
           2       0.89      0.93      0.91       900
           3       0.80      0.71      0.75        49

    accuracy                           0.83      1296
   macro avg       0.74      0.67      0.70      1296
weighted avg       0.83      0.83      0.83      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.60      0.53      0.57        88
           1       0.42      0.28      0.33        18
           2       0.88      0.93      0.90       310
           3       0.79      0.69      0.73        16

    accuracy                           0.81       432
   macro avg       0.67      0.61      0.63       432
weighted avg       0.80      0.81      0.80       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.83
Accuracy on Test Data is: 0.81
--------------------------------------------------------------------------
Precision on Train Data is: 0.83
Precision on Test Data is: 0.8
--------------------------------------------------------------------------
Recall on Train Data is: 0.83
Recall on Test Data is: 0.81
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.83
F1 Score on Test Data is: 0.8
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Decision Tree Classifier¶

dt = DecisionTreeClassifier(max_depth = 7,random_state = 48) # Keeping max_depth = 7 to avoid overfitting
dt.fit(X_train_enc,y_train_enc)

yp_train_enc = dt.predict(X_train_enc)
yp_test_enc = dt.predict(X_test_enc)

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(dt,X_test_enc, y_test_enc)

dict1 = create_dict(dt, "Decision Tree Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(dt, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Decision Tree Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.84      0.96      0.89       296
           1       0.83      0.59      0.69        51
           2       0.99      0.96      0.98       900
           3       0.79      0.78      0.78        49

    accuracy                           0.94      1296
   macro avg       0.86      0.82      0.84      1296
weighted avg       0.94      0.94      0.94      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.78      0.94      0.85        88
           1       0.83      0.56      0.67        18
           2       0.99      0.95      0.97       310
           3       0.88      0.88      0.88        16

    accuracy                           0.93       432
   macro avg       0.87      0.83      0.84       432
weighted avg       0.94      0.93      0.93       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.94
Accuracy on Test Data is: 0.93
--------------------------------------------------------------------------
Precision on Train Data is: 0.94
Precision on Test Data is: 0.94
--------------------------------------------------------------------------
Recall on Train Data is: 0.94
Recall on Test Data is: 0.93
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.94
F1 Score on Test Data is: 0.93
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

K Nearest Neighbors Classifier¶

# training a KNN classifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 7)

st_time = time.time()

knn.fit(X_train_enc,y_train_enc)

yp_train_enc = knn.predict(X_train_enc)
yp_test_enc = knn.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(knn,X_test_enc, y_test_enc)

dict1 = create_dict(knn, "K Nearest Neighbor Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(knn, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Knn Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

Total time: 0.08s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.94      0.97      0.96       296
           1       0.96      0.84      0.90        51
           2       0.99      0.99      0.99       900
           3       1.00      0.90      0.95        49

    accuracy                           0.98      1296
   macro avg       0.97      0.93      0.95      1296
weighted avg       0.98      0.98      0.98      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.86      0.94      0.90        88
           1       0.92      0.67      0.77        18
           2       0.98      0.98      0.98       310
           3       1.00      0.81      0.90        16

    accuracy                           0.95       432
   macro avg       0.94      0.85      0.89       432
weighted avg       0.95      0.95      0.95       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.98
Accuracy on Test Data is: 0.95
--------------------------------------------------------------------------
Precision on Train Data is: 0.98
Precision on Test Data is: 0.95
--------------------------------------------------------------------------
Recall on Train Data is: 0.98
Recall on Test Data is: 0.95
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.98
F1 Score on Test Data is: 0.95
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Naive Bayes Classifier¶

# training a Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

st_time = time.time()

gnb.fit(X_train_enc,y_train_enc)

yp_train_enc = gnb.predict(X_train_enc)
yp_test_enc = gnb.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(gnb,X_test_enc, y_test_enc)

dict1 = create_dict(gnb, "Naive Bayes Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(gnb, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Naive Bayes Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

Total time: 0.00s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.67      0.23      0.35       296
           1       0.55      0.24      0.33        51
           2       0.87      0.87      0.87       900
           3       0.18      1.00      0.30        49

    accuracy                           0.70      1296
   macro avg       0.57      0.58      0.46      1296
weighted avg       0.78      0.70      0.71      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.58      0.20      0.30        88
           1       0.50      0.17      0.25        18
           2       0.87      0.87      0.87       310
           3       0.19      1.00      0.32        16

    accuracy                           0.71       432
   macro avg       0.54      0.56      0.44       432
weighted avg       0.77      0.71      0.71       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.7
Accuracy on Test Data is: 0.71
--------------------------------------------------------------------------
Precision on Train Data is: 0.78
Precision on Test Data is: 0.77
--------------------------------------------------------------------------
Recall on Train Data is: 0.7
Recall on Test Data is: 0.71
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.71
F1 Score on Test Data is: 0.71
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Random Forest Classifier¶

rf = RandomForestClassifier(max_depth = 7,random_state = 48) # Keeping max_depth = 7 same as DT

st_time = time.time()

rf.fit(X_train_enc,y_train_enc)

yp_train_enc = rf.predict(X_train_enc)
yp_test_enc = rf.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(rf,X_test_enc, y_test_enc)

dict1 = create_dict(rf, "Random Forest Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(rf, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Random Forest Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

Total time: 0.17s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       296
           1       0.93      0.84      0.89        51
           2       1.00      0.99      0.99       900
           3       0.93      0.76      0.83        49

    accuracy                           0.97      1296
   macro avg       0.94      0.89      0.92      1296
weighted avg       0.98      0.97      0.97      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.81      0.93      0.87        88
           1       0.86      0.67      0.75        18
           2       0.99      0.96      0.97       310
           3       0.93      0.88      0.90        16

    accuracy                           0.94       432
   macro avg       0.90      0.86      0.87       432
weighted avg       0.94      0.94      0.94       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.97
Accuracy on Test Data is: 0.94
--------------------------------------------------------------------------
Precision on Train Data is: 0.98
Precision on Test Data is: 0.94
--------------------------------------------------------------------------
Recall on Train Data is: 0.97
Recall on Test Data is: 0.94
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.97
F1 Score on Test Data is: 0.94
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Linear SVC Classifier¶

svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)

st_time = time.time()
svm.fit(X_train_enc,y_train_enc)

yp_train_enc = svm.predict(X_train_enc)
yp_test_enc = svm.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(svm,X_test_enc, y_test_enc)

dict1 = create_dict(svm, "Linear SVC", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(svm, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Linear SVC Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

Total time: 0.01s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.78      0.66      0.71       296
           1       0.48      0.94      0.63        51
           2       0.91      0.89      0.90       900
           3       0.61      0.76      0.67        49

    accuracy                           0.84      1296
   macro avg       0.69      0.81      0.73      1296
weighted avg       0.85      0.84      0.84      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.74      0.64      0.68        88
           1       0.50      1.00      0.67        18
           2       0.90      0.88      0.89       310
           3       0.59      0.62      0.61        16

    accuracy                           0.83       432
   macro avg       0.68      0.79      0.71       432
weighted avg       0.84      0.83      0.83       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.84
Accuracy on Test Data is: 0.83
--------------------------------------------------------------------------
Precision on Train Data is: 0.85
Precision on Test Data is: 0.84
--------------------------------------------------------------------------
Recall on Train Data is: 0.84
Recall on Test Data is: 0.83
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.84
F1 Score on Test Data is: 0.83
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Gradient Boosting¶

gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=10)

st_time = time.time()
gb_model.fit(X_train_enc,y_train_enc)

yp_train_enc = gb_model.predict(X_train_enc)
yp_test_enc = gb_model.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(svm,X_test_enc, y_test_enc)

dict1 = create_dict(gb_model, "Gradient Boosting", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(gb_model, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Gradient Boosting Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))

Total time: 0.96s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       296
           1       1.00      1.00      1.00        51
           2       1.00      1.00      1.00       900
           3       1.00      1.00      1.00        49

    accuracy                           1.00      1296
   macro avg       1.00      1.00      1.00      1296
weighted avg       1.00      1.00      1.00      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.93      0.98      0.96        88
           1       0.93      0.78      0.85        18
           2       0.98      0.98      0.98       310
           3       1.00      0.94      0.97        16

    accuracy                           0.97       432
   macro avg       0.96      0.92      0.94       432
weighted avg       0.97      0.97      0.97       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 1.0
Accuracy on Test Data is: 0.97
--------------------------------------------------------------------------
Precision on Train Data is: 1.0
Precision on Test Data is: 0.97
--------------------------------------------------------------------------
Recall on Train Data is: 1.0
Recall on Test Data is: 0.97
--------------------------------------------------------------------------
F1 Score on Train Data is: 1.0
F1 Score on Test Data is: 0.97
--------------------------------------------------------------------------

/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)

<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

st_time = time.time()
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

Total time: 0.00s

Listing the performance from all the models¶

# Retrieving the perfomance scores from the dict object 
pd.DataFrame.from_dict({(i,j): dict[i][j] 
                           for i in dict.keys() 
                           for j in dict[i].keys()},
                       orient='index')

# Retrieving the perfomance scores from the dict object
# Transposing the rows and headers with respect to the previous code line
# Tabulating the scores for the different classifiers

user_ids = []
frames = []

for user_id, d in dict.items():
    user_ids.append(user_id)
    frames.append(pd.DataFrame.from_dict(d, orient='columns'))

df = pd.concat(frames, keys=user_ids)
df.unstack(level = -1).style.background_gradient(cmap='Blues')

Conclusion¶

Decision Tree Classifier, K Nearest Neighbours, Random Forest Classifiers, Gradient Boosting have performed high in F1, Recall and Precision measures for both test and train methods.
Gradient Boosting has given the best performance. However, note that this is accompanied with lack of interpretability and explainability.
Logistic Regression, Linear SVC followed by Naive Bayes have low scores on the evaluation metrics.

Resources to follow for imbalanced learning¶

http://www.svds.com/learning-imbalanced-classes/ (The Applied Data Science Workshop - Second Edition Get started with the applications of data science and techniques to explore and assess data effectively by Alex Galea)
Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido
Machine Learning Engineering by Andriy Burkov
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study - ADNAN AMIN1, SAJID ANWAR1, AWAIS ADNAN1, MUHAMMAD NAWAZ1, NEWTON HOWARD2, JUNAID QADIR3, (Senior Member, IEEE), AHMAD HAWALAH4, AND AMIR HUSSAIN5, (Senior Member, IEEE)
Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18 (17), 1-5.
Learning from imbalanced data: open challenges and future directions by Bartosz Krawczyk1
Evaluating machine learning models - a beginner's guide to key concepts and pitfalls by Zheng, Alice
Fawcett, Tom. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27 (8): 861–874.
https://www.codespeedy.com/multiclass-classification-using-scikit-learn/
The Essentials of Machine Learning in Finance and Accounting, Chapter 11, Handling class imbalance data in business domain - Authored by Md. Shajalal, Mohammad Zoynul Abedin and Mohammed Mohi Uddin
Learning from imbalanced datasets* by Alberto Fernandez et al
https://win-vector.com/2015/02/27/does-balancing-classes-improve-classifier-performance/
https://datascientistdiary.com/index.php/2021/09/02/how-to-handle-imbalanced-data-example-in-r/

Datasets¶

Breast Cancer Wisconsin dataset
Credit Card Fraud detection Kaggle dataset

Real life scenarios with data imbalance¶

Learning from imbalanced data: open challenges and future directions by Bartosz Krawczyk

Application Area	Problem Description
Activity Recognition	Detection of rare or less-frequent activities (multi-class problem)
Behavior Analysis	Recognition of dangerous behavior (binary problem)
Cancer Malignancy grading	Analyzing the cancer severity (binary and multi-class problem)
Hyperspectral data analysis	Classification of varying areas in multi-dimensional images (multi-class problem)
Industrial Systems monitoring	Fault detection in industrial machinery (binary problem)
Sentiment analysis	Emotion and temper recognition in text (binary and multi-class problem)
Software defect prediction	Recognition of errors in code blocks (binary problem)
Target detection	Classification of specified targets appearing with varied frequency (multi-class problem)
Text mining	Detecting relations in literature (binary problem)
Video mining	Recognizing objects and actions in video sequences (binary and multi-class problem)

	F1		Recall		Precision
	Train	Test	Train	Test	Train	Test
Logistic Regression Classifier	0.830000	0.800000	0.830000	0.810000	0.830000	0.800000
Decision Tree Classifier	0.940000	0.930000	0.940000	0.930000	0.940000	0.940000
K Nearest Neighbor Classifier	0.980000	0.950000	0.980000	0.950000	0.980000	0.950000
Naive Bayes Classifier	0.710000	0.710000	0.700000	0.710000	0.780000	0.770000
Random Forest Classifier	0.970000	0.940000	0.970000	0.940000	0.980000	0.940000
Linear SVC	0.840000	0.830000	0.840000	0.830000	0.850000	0.840000
Gradient Boosting	1.000000	0.970000	1.000000	0.970000	1.000000	0.970000

	buying	maint	doors	persons	lug_boot	safety	class
0	vhigh	vhigh	2	2	small	low	unacc
1	vhigh	vhigh	2	2	small	med	unacc
2	vhigh	vhigh	2	2	small	high	unacc
3	vhigh	vhigh	2	2	med	low	unacc
4	vhigh	vhigh	2	2	med	med	unacc
...	...	...	...	...	...	...	...
1723	low	low	5more	more	med	med	good
1724	low	low	5more	more	med	high	vgood
1725	low	low	5more	more	big	low	unacc
1726	low	low	5more	more	big	med	good
1727	low	low	5more	more	big	high	vgood

	buying	maint	doors	persons	lug_boot	safety	class
count	1728	1728	1728	1728	1728	1728	1728
unique	4	4	4	3	3	3	4
top	low	low	5more	4	big	low	unacc
freq	432	432	432	576	576	576	1210

buying	low				med				high				vhigh
maint	low	med	high	vhigh	low	med	high	vhigh	low	med	high	vhigh	low	med	high	vhigh
class
unacc	62	62	62	72	62	62	72	72	72	72	72	108	72	72	108	108
acc	10	10	33	36	10	33	36	36	36	36	36	0	36	36	0	0
good	23	23	0	0	23	0	0	0	0	0	0	0	0	0	0	0

	buying	maint	class	count
0	low	low	unacc	62
1	low	low	acc	10
2	low	low	good	23
3	low	low	v-good	0
4	low	med	unacc	62
5	low	med	acc	10
6	low	med	good	23
7	low	med	v-good	0
8	low	high	unacc	62
9	low	high	acc	33
10	low	high	good	0
11	low	high	v-good	0
12	low	vhigh	unacc	72
13	low	vhigh	acc	36
14	low	vhigh	good	0
15	low	vhigh	v-good	0
16	med	low	unacc	62
17	med	low	acc	10
18	med	low	good	23
19	med	low	v-good	0
20	med	med	unacc	62
21	med	med	acc	33
22	med	med	good	0
23	med	med	v-good	0
24	med	high	unacc	72
25	med	high	acc	36
26	med	high	good	0
27	med	high	v-good	0
28	med	vhigh	unacc	72
29	med	vhigh	acc	36
30	med	vhigh	good	0
31	med	vhigh	v-good	0
32	high	low	unacc	72
33	high	low	acc	36
34	high	low	good	0
35	high	low	v-good	0
36	high	med	unacc	72
37	high	med	acc	36
38	high	med	good	0
39	high	med	v-good	0
40	high	high	unacc	72
41	high	high	acc	36
42	high	high	good	0
43	high	high	v-good	0
44	high	vhigh	unacc	108
45	high	vhigh	acc	0
46	high	vhigh	good	0
47	high	vhigh	v-good	0
48	vhigh	low	unacc	72
49	vhigh	low	acc	36
50	vhigh	low	good	0
51	vhigh	low	v-good	0
52	vhigh	med	unacc	72
53	vhigh	med	acc	36
54	vhigh	med	good	0
55	vhigh	med	v-good	0
56	vhigh	high	unacc	108
57	vhigh	high	acc	0
58	vhigh	high	good	0
59	vhigh	high	v-good	0
60	vhigh	vhigh	unacc	108
61	vhigh	vhigh	acc	0
62	vhigh	vhigh	good	0
63	vhigh	vhigh	v-good	0

maint	low	med	high	vhigh
buying
low	108	108	108	108
med	108	108	108	108
high	108	108	108	108
vhigh	108	108	108	108

	class	unacc	acc	good
doors	lug_boot
2	small	126	15	3
	med	108	30	6
	big	92	36	6
3	small	108	30	6
	med	100	33	6
	big	92	36	6
4	small	108	30	6
	med	92	36	6
	big	92	36	6
5more	small	108	30	6
	med	92	36	6
	big	92	36	6

	buying	maint	doors	persons	lug_boot	safety
142	vhigh	high	3	2	big	med
1026	med	high	4	2	small	low
537	high	vhigh	5more	more	big	low
1298	low	vhigh	2	2	small	high
1296	low	vhigh	2	2	small	low
...	...	...	...	...	...	...
715	high	med	4	4	med	med
905	med	vhigh	3	4	med	high
1096	med	med	2	4	big	med
235	vhigh	med	2	more	small	med
1061	med	high	5more	2	big	high

		Train	Test
Logistic Regression Classifier	F1	0.83	0.80
	Recall	0.83	0.81
	Precision	0.83	0.80
Decision Tree Classifier	F1	0.94	0.93
Decision Tree Classifier	Recall	0.94	0.93
...	...	...	...
Linear SVC	Recall	0.84	0.83
Linear SVC	Precision	0.85	0.84
Gradient Boosting	F1	1.00	0.97
	Recall	1.00	0.97
	Precision	1.00	0.97

Table of Contents