Car Evaluation

This dataset has been downloaded from UC Irvine Machine Learning Repository.

This dataset is regarding evaluation of cars.
The target variable/label is car acceptability and has four categories : unacceptable, acceptable, good and very good.

The input attributes fall under two broad categories - Price and Technical Characteristics.

  • Under Price, the attributes are buying price and maintenance price.
  • Under Technical characteristics, the attributes are doors, persons, size of luggage boot and safety.

We have identified : this is an imbalanced dataset with skewed class (output category/label) proportions.

The objective is here to build a model to give multiclass classifier model based on the input attributes.

Summary of Key information

Number of Instances/training examples          : 1728  
Number of Instances with missing attributes    :    0  
Number of qualified Instances/training examples :   0

Number of Input Attributes                     :  6
Number of categorical attributes               :  6
Number of numerical attributes                 :  0

Target Attribute Type                          : Multi class label
Target Class distribution                      : 70%:22%:3.9%:3.7%
Problem Identification                         : Multiclass Classification with imbalanced data set

Importing the necessary libraries

In [350]:
# Data Wrangling, inspection 
import numpy as np
import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt

# Data preprocessing 
import category_encoders as ce 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OrdinalEncoder

# sklearn ml models
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC

# Evaluation metrics
from sklearn.metrics import recall_score, precision_score, \
accuracy_score, plot_confusion_matrix, classification_report, f1_score

Loading the data set

In [351]:
pathname = "/Users/bhaskarroy/BHASKAR FILES/BHASKAR CAREER/Career/Skills/Data Science/Practise/Python/UCI Machine Learning Repository/car"
path0 = "/car.c45-names"
path1 = "/car.data"
path2 = "/car.names"


pathdata = pathname + path1
pathcolname = pathname + path0
pathdatadesc = pathname + path2

Dataset Information

In [352]:
with open(pathdatadesc) as f:
    print(f.read())
1. Title: Car Evaluation Database

2. Sources:
   (a) Creator: Marko Bohanec
   (b) Donors: Marko Bohanec   (marko.bohanec@ijs.si)
               Blaz Zupan      (blaz.zupan@ijs.si)
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

   CAR                      car acceptability
   . PRICE                  overall price
   . . buying               buying price
   . . maint                price of the maintenance
   . TECH                   technical characteristics
   . . COMFORT              comfort
   . . . doors              number of doors
   . . . persons            capacity in terms of persons to carry
   . . . lug_boot           the size of luggage boot
   . . safety               estimated safety of the car

   Input attributes are printed in lowercase. Besides the target
   concept (CAR), the model includes three intermediate concepts:
   PRICE, TECH, COMFORT. Every concept is in the original model
   related to its lower level descendants by a set of examples (for
   these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

   The Car Evaluation Database contains examples with the structural
   information removed, i.e., directly relates CAR to the six input
   attributes: buying, maint, doors, persons, lug_boot, safety.

   Because of known underlying concept structure, this database may be
   particularly useful for testing constructive induction and
   structure discovery methods.

5. Number of Instances: 1728
   (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %) 

Attribute Information

In [353]:
with open(pathcolname) as f:
    print(f.read())
| names file (C4.5 format) for car evaluation domain

| class values

unacc, acc, good, vgood

| attributes

buying:   vhigh, high, med, low.
maint:    vhigh, high, med, low.
doors:    2, 3, 4, 5more.
persons:  2, 4, more.
lug_boot: small, med, big.
safety:   low, med, high.

Data Preprocessing

We will prepare the data for :

  • Exploratory Data analysis (EDA) and
  • for model building

Following actions were undertaken:

  • Converting to Dataframe Format
  • Inspect if any missing values present
  • Handling Missing values : There are no missing values. Hnece, entire data can be considered for model building
  • Processing Categorical Attributes : categorical attributes have been converted to categorical data type for EDA.
  • Processing Continous Attributes : not applicable as both the input and output attributes are categorical.

Converting to Dataframe Format

In [354]:
colnames = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data = pd.read_csv(pathdata, names = colnames, index_col = False)
data
Out[354]:
buying maint doors persons lug_boot safety class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
... ... ... ... ... ... ... ...
1723 low low 5more more med med good
1724 low low 5more more med high vgood
1725 low low 5more more big low unacc
1726 low low 5more more big med good
1727 low low 5more more big high vgood

1728 rows × 7 columns

In [355]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB
In [356]:
data.describe()
Out[356]:
buying maint doors persons lug_boot safety class
count 1728 1728 1728 1728 1728 1728 1728
unique 4 4 4 3 3 3 4
top low low 5more 4 big low unacc
freq 432 432 432 576 576 576 1210
In [357]:
data.dtypes
Out[357]:
buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
class       object
dtype: object
In [358]:
# Inspect if any missing values present
data.isnull().sum()
Out[358]:
buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

Converting columns to categorical data types

In [359]:
for i in data.columns :
    print(f'{i} : {data[i].unique().tolist()}')
buying : ['vhigh', 'high', 'med', 'low']
maint : ['vhigh', 'high', 'med', 'low']
doors : ['2', '3', '4', '5more']
persons : ['2', '4', 'more']
lug_boot : ['small', 'med', 'big']
safety : ['low', 'med', 'high']
class : ['unacc', 'acc', 'vgood', 'good']
In [360]:
cat_vars = data.columns
In [361]:
catdict = {
  "buying":   ['low','med','high', 'vhigh' ],
    "maint":    ['low','med','high', 'vhigh' ],
    "doors":    ['2', '3', '4', '5more'],
    "persons" :  ['2', '4', 'more'],
    "lug_boot" : ['small', 'med', 'big'],
    "safety" :  ['low', 'med', 'high'],
    "class":['unacc', 'acc','good', 'v-good']
}
In [362]:
for i in cat_vars :
    data[i] = pd.Categorical(data[i], 
                             categories=catdict[i], ordered=True)
In [363]:
data.dtypes
Out[363]:
buying      category
maint       category
doors       category
persons     category
lug_boot    category
safety      category
class       category
dtype: object
In [364]:
def show(data):
  for i in data.columns[0:]:
    print("Feature: {} with {} Levels".format(i,data[i].unique()))

show(data)
Feature: buying with ['vhigh', 'high', 'med', 'low']
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh'] Levels
Feature: maint with ['vhigh', 'high', 'med', 'low']
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh'] Levels
Feature: doors with ['2', '3', '4', '5more']
Categories (4, object): ['2' < '3' < '4' < '5more'] Levels
Feature: persons with ['2', '4', 'more']
Categories (3, object): ['2' < '4' < 'more'] Levels
Feature: lug_boot with ['small', 'med', 'big']
Categories (3, object): ['small' < 'med' < 'big'] Levels
Feature: safety with ['low', 'med', 'high']
Categories (3, object): ['low' < 'med' < 'high'] Levels
Feature: class with ['unacc', 'acc', NaN, 'good']
Categories (3, object): ['unacc' < 'acc' < 'good'] Levels
In [365]:
data.isnull().sum()
Out[365]:
buying       0
maint        0
doors        0
persons      0
lug_boot     0
safety       0
class       65
dtype: int64

Univariate Analysis : Categorical Variables

In [366]:
#Accessing colors from external library Palettable
from palettable.cartocolors.qualitative import Bold_10 
colors = Bold_10.mpl_colors

#colors = plt.cm.Dark2(range(15))
#colors = plt.cm.tab20(range(15))

#.colors attribute for listed colormaps
#colors = plt.cm.tab10.colors 
#colors = plt.cm.Paired.colors
In [367]:
# custom function for easy and efficient analysis of categorical univariate
def UVA_category(data_frame, var_group = [], **kargs):

  '''
  Stands for Univariate_Analysis_categorical
  takes a group of variables (category) and plot/print all the value_counts and horizontal barplot.

  - data_frame : The Dataframe
  - var_group : The list of column names for univariate plots need to be plotted

  The keyword arguments are as follows :
  - col_count : The number of columns in the plot layout. Default value set is 2.
  For instance, if there are are 4# columns in var_group, then 4# univariate plots will be plotted in 2X2 layout.
  - colwidth : width of each plot
  - rowheight : height of each plot
  - normalize : Whether to present absolute values or percentage
  - sort_by : Whether to sort the bars by descending order of values

  - axlabel_fntsize : Fontsize of x axis and y axis labels
  - infofntsize : Fontsize in the info of unique value counts
  - axlabel_fntsize : fontsize of axis labels
  - axticklabel_fntsize : fontsize of axis tick labels
  - infofntfamily : Font family of info of unique value counts.
  Choose font family belonging to Monospace for multiline alignment.
  Some choices are : 'Consolas', 'Courier','Courier New', 'Lucida Sans Typewriter','Lucidatypewriter','Andale Mono'
  https://www.tutorialbrain.com/css_tutorial/css_font_family_list/
  - max_val_counts : Number of unique values for which count should be displayed
  - nspaces : Length of each line for the multiline strings in the info area for value_counts
  - ncountspaces : Length allocated to the count value for the unique values in the info area
  - show_percentage : Whether to show percentage of total for each unique value count
  Also check link for formatting syntax : https://pyformat.info/#number
  '''

  import textwrap
  data = data_frame.copy(deep = True)
  # Using dictionary with default values of keywrod arguments
  params_plot = dict(colcount = 2, colwidth = 7, rowheight = 4, normalize = False, sort_by = "Values")
  params_fontsize =  dict(axlabel_fntsize = 10,axticklabel_fntsize = 8, infofntsize = 10)
  params_fontfamily = dict(infofntfamily = 'Andale Mono')
  params_max_val_counts = dict(max_val_counts = 10)
  params_infospaces = dict(nspaces = 10, ncountspaces = 4)
  params_show_percentage = dict(show_percentage = True)



  # Updating the dictionary with parameter values passed while calling the function
  params_plot.update((k, v) for k, v in kargs.items() if k in params_plot)
  params_fontsize.update((k, v) for k, v in kargs.items() if k in params_fontsize)
  params_fontfamily.update((k, v) for k, v in kargs.items() if k in params_fontfamily)
  params_max_val_counts.update((k, v) for k, v in kargs.items() if k in params_max_val_counts)
  params_infospaces.update((k, v) for k, v in kargs.items() if k in params_infospaces)
  params_show_percentage.update((k, v) for k, v in kargs.items() if k in params_show_percentage)

  #params = dict(**params_plot, **params_fontsize)

  # Initialising all the possible keyword arguments of doc string with updated values
  colcount = params_plot['colcount']
  colheight = params_plot['colheight']
  rowheight = params_plot['rowheight']
  normalize = params_plot['normalize']
  sort_by = params_plot['sort_by']

  axlabel_fntsize = params_fontsize['axlabel_fntsize']
  axticklabel_fntsize = params_fontsize['axticklabel_fntsize']
  infofntsize = params_fontsize['infofntsize']
  infofntfamily = params_fontfamily['infofntfamily']
  max_val_counts =  params_max_val_counts['max_val_counts']
  nspaces = params_infospaces['nspaces']
  ncountspaces = params_infospaces['ncountspaces']
  show_percentage = params_show_percentage['show_percentage']

  if len(var_group) == 0:
        var_group = df.select_dtypes(exclude = ['number']).columns.to_list()

  import matplotlib.pyplot as plt
  plt.rcdefaults()
  # setting figure_size
  size = len(var_group)
  #rowcount = 1
  #colcount = size//rowcount+(size%rowcount != 0)*1


  colcount = colcount
  #print(colcount)
  rowcount = size//colcount+(size%colcount != 0)*1

  plt.figure(figsize = (colwidth*colcount,rowheight*rowcount), dpi = 150)


  # Converting the filtered columns as categorical
  for i in var_group:
        #data[i] = data[i].astype('category')
        data[i] = pd.Categorical(data[i])


  # for every variable
  for j,i in enumerate(var_group):
    #print('{} : {}'.format(j,i))
    norm_count = data[i].value_counts(normalize = normalize).sort_index()
    n_uni = data[i].nunique()

    if sort_by == "Values":
        norm_count = data[i].value_counts(normalize = normalize).sort_values(ascending = False)
        n_uni = data[i].nunique()


  #Plotting the variable with every information
    plt.subplot(rowcount,colcount,j+1)
    sns.barplot(x = norm_count, y = norm_count.index , order = norm_count.index)

    if normalize == False :
        plt.xlabel('count', fontsize = axlabel_fntsize )
    else :
        plt.xlabel('fraction/percent', fontsize = axlabel_fntsize )
    plt.ylabel('{}'.format(i), fontsize = axlabel_fntsize )

    ax = plt.gca()

    # textwrapping
    ax.set_yticklabels([textwrap.fill(str(e), 20) for e in norm_count.index], fontsize = axticklabel_fntsize)

    #print(n_uni)
    #print(type(norm_count.round(2)))

    # Functions to convert the pairing of unique values and value_counts into text string
    # Function to break a word into multiline string of fixed width per line
    def paddingString(word, nspaces = 20):
        i = len(word)//nspaces \
            +(len(word)%nspaces > 0)*(len(word)//nspaces > 0)*1 \
            + (len(word)//nspaces == 0)*1
        strA = ""
        for j in range(i-1):
            strA = strA+'\n'*(len(strA)>0)+ word[j*nspaces:(j+1)*nspaces]

        # insert appropriate number of white spaces
        strA = strA + '\n'*(len(strA)>0)*(i>1)+word[(i-1)*nspaces:] \
               + " "*(nspaces-len(word)%nspaces)*(len(word)%nspaces > 0)
        return strA

    # Function to convert Pandas series into multi line strings
    def create_string_for_plot(ser, nspaces = nspaces, ncountspaces = ncountspaces, \
                              show_percentage =  show_percentage):
        '''
        - nspaces : Length of each line for the multiline strings in the info area for value_counts
        - ncountspaces : Length allocated to the count value for the unique values in the info area
        - show_percentage : Whether to show percentage of total for each unique value count
        Also check link for formatting syntax : https://pyformat.info/#number
        '''
        str_text = ""
        for index, value in ser.items():
            str_tmp = paddingString(str(index), nspaces)+ " : " \
                      + " "*(ncountspaces-len(str(value)))*(len(str(value))<= ncountspaces) \
                      + str(value) \
                      + (" | " + "{:4.1f}%".format(value/ser.sum()*100))*show_percentage


            str_text = str_text + '\n'*(len(str_text)>0) + str_tmp
        return str_text

    #print(create_string_for_plot(norm_count.round(2)))

    #Ensuring a maximum of 10 unique values displayed
    if norm_count.round(2).size <= max_val_counts:
        text = '{}\nn_uniques = {}\nvalue counts\n{}' \
                .format(i, n_uni,create_string_for_plot(norm_count.round(2)))
        ax.annotate(text = text,
                    xy = (1.1, 1), xycoords = ax.transAxes,
                    ha = 'left', va = 'top', fontsize = infofntsize, fontfamily = infofntfamily)
    else :
        text = '{}\nn_uniques = {}\nvalue counts of top 10\n{}' \
                .format(i, n_uni,create_string_for_plot(norm_count.round(2)[0:max_val_counts]))
        ax.annotate(text = text,
                    xy = (1.1, 1), xycoords = ax.transAxes,
                    ha = 'left', va = 'top', fontsize = infofntsize, fontfamily = infofntfamily)


    plt.gcf().tight_layout()

Overview of all the categorical variables

In [368]:
from eda import eda_overview
eda_overview.UVA_category(data, data.columns,
                          rowheight = 3, normalize = False);
Categorical features : Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object') 

In [369]:
from eda import eda_overview
eda_overview.UVA_category(data, data.columns,
                          rowheight = 3, normalize = True);
Categorical features : Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object') 

In [370]:
from eda import composite_plots
composite_plots.bar_counts(data, data.columns)

For the car to be acceptable, it has to be low atleast in one of the pricing parameters - maintenance or buying price

In [371]:
dv = "buying"
df = pd.crosstab([data[dv]],[data['class']])
df.head()
df.plot(kind='bar', stacked = True, title=dv )
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.ylabel('Class Frequency')
Out[371]:
Text(0, 0.5, 'Class Frequency')
In [372]:
data.columns
Out[372]:
Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')
In [373]:
from eda.composite_plots import features_plots
features_plots(data, ['buying', 'maint'], 'class')
In [374]:
pd.crosstab(index = [data['buying'],data['maint']], columns = data['class'], margins = False).transpose()
Out[374]:
buying low med high vhigh
maint low med high vhigh low med high vhigh low med high vhigh low med high vhigh
class
unacc 62 62 62 72 62 62 72 72 72 72 72 108 72 72 108 108
acc 10 10 33 36 10 33 36 36 36 36 36 0 36 36 0 0
good 23 23 0 0 23 0 0 0 0 0 0 0 0 0 0 0
In [375]:
data.pivot_table(index=['buying','maint'], aggfunc='size')
Out[375]:
buying  maint
low     low      108
        med      108
        high     108
        vhigh    108
med     low      108
        med      108
        high     108
        vhigh    108
high    low      108
        med      108
        high     108
        vhigh    108
vhigh   low      108
        med      108
        high     108
        vhigh    108
dtype: int64
In [376]:
df3 = data.groupby(["buying", "maint"]).size().reset_index(name="value_count")
sns.barplot(x = 'buying', y = 'value_count', hue = 'maint', data = df3)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[376]:
<matplotlib.legend.Legend at 0x7fac94fdab50>

CAR : car acceptability

  • PRICE : overall price

    • buying : buying price
    • maint : price of the maintenance
  • TECH : technical characteristics

    • COMFORT comfort
    • doors number of doors
    • persons capacity in terms of persons to carry -lug_boot the size of luggage boot
  • safety estimated safety of the car

In [377]:
# Random inspection - price vs class
pd.set_option('display.max_rows', 20)
df4 = pd.DataFrame({'count' : data.groupby(by = ['buying','maint','class']).size()}).reset_index()
df4.style.background_gradient(cmap='Blues')
#https://stackoverflow.com/questions/12286607/making-heatmap-from-pandas-dataframe
Out[377]:
buying maint class count
0 low low unacc 62
1 low low acc 10
2 low low good 23
3 low low v-good 0
4 low med unacc 62
5 low med acc 10
6 low med good 23
7 low med v-good 0
8 low high unacc 62
9 low high acc 33
10 low high good 0
11 low high v-good 0
12 low vhigh unacc 72
13 low vhigh acc 36
14 low vhigh good 0
15 low vhigh v-good 0
16 med low unacc 62
17 med low acc 10
18 med low good 23
19 med low v-good 0
20 med med unacc 62
21 med med acc 33
22 med med good 0
23 med med v-good 0
24 med high unacc 72
25 med high acc 36
26 med high good 0
27 med high v-good 0
28 med vhigh unacc 72
29 med vhigh acc 36
30 med vhigh good 0
31 med vhigh v-good 0
32 high low unacc 72
33 high low acc 36
34 high low good 0
35 high low v-good 0
36 high med unacc 72
37 high med acc 36
38 high med good 0
39 high med v-good 0
40 high high unacc 72
41 high high acc 36
42 high high good 0
43 high high v-good 0
44 high vhigh unacc 108
45 high vhigh acc 0
46 high vhigh good 0
47 high vhigh v-good 0
48 vhigh low unacc 72
49 vhigh low acc 36
50 vhigh low good 0
51 vhigh low v-good 0
52 vhigh med unacc 72
53 vhigh med acc 36
54 vhigh med good 0
55 vhigh med v-good 0
56 vhigh high unacc 108
57 vhigh high acc 0
58 vhigh high good 0
59 vhigh high v-good 0
60 vhigh vhigh unacc 108
61 vhigh vhigh acc 0
62 vhigh vhigh good 0
63 vhigh vhigh v-good 0
In [378]:
df4
Out[378]:
buying maint class count
0 low low unacc 62
1 low low acc 10
2 low low good 23
3 low low v-good 0
4 low med unacc 62
... ... ... ... ...
59 vhigh high v-good 0
60 vhigh vhigh unacc 108
61 vhigh vhigh acc 0
62 vhigh vhigh good 0
63 vhigh vhigh v-good 0

64 rows × 4 columns

In [379]:
#pd.DataFrame(data[['buying','maint','class']].value_counts())

Price vs Class heatmap

Price has two components - buying price and maintenance.

  • Data is equally distributed for buying price (4 levels) and maintenance(4 levels) combinations.
  • Class instances of good and very good have buying price and maintenance ranging from low and medium.

Buying Price vs Class

In [380]:
pd.crosstab(index = [data['buying']], columns = data['class'], margins = False)
Out[380]:
class unacc acc good
buying
low 258 89 46
med 268 115 23
high 324 108 0
vhigh 360 72 0
In [381]:
from eda.composite_plots import heatmap_plot
heatmap_plot(data, index = ['buying'], column = ['class'])

Maintenance vs Class

In [382]:
pd.crosstab(index = [data['maint']], columns = data['class'], margins = False)
Out[382]:
class unacc acc good
maint
low 268 92 46
med 268 115 23
high 314 105 0
vhigh 360 72 0
In [383]:
fig, ax = plt.subplots(figsize=(4,3))
#font_kwds = dict(fontsize = 12)

df_MainVsClass = pd.crosstab(index = [data['maint']], columns = data['class'], margins = False)

sns.heatmap(df_MainVsClass)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 12) 
plt.xlabel(ax.get_xlabel(), fontsize = 15, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 15, fontweight = 'heavy')
Out[383]:
Text(20.72222222222222, 0.5, 'maint')

Buying vs Maintenance

In [384]:
df_BuyingVsMaintenance = pd.crosstab(index = [data['buying']], columns = data['maint'], margins = False)

df_BuyingVsMaintenance
Out[384]:
maint low med high vhigh
buying
low 108 108 108 108
med 108 108 108 108
high 108 108 108 108
vhigh 108 108 108 108
In [385]:
fig, ax = plt.subplots(figsize=(4,3))
#font_kwds = dict(fontsize = 12)
                       
sns.heatmap(df_BuyingVsMaintenance)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 12) 
plt.xlabel(ax.get_xlabel(), fontsize = 15, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 15, fontweight = 'heavy')
Out[385]:
Text(20.72222222222222, 0.5, 'buying')

Price (Buying Price and Maintenance) vs Class

In [386]:
df_PricingVsClass = pd.crosstab(index = [data['buying'],data['maint']], columns = data['class'], margins = False)

df_PricingVsClass
Out[386]:
class unacc acc good
buying maint
low low 62 10 23
med 62 10 23
high 62 33 0
vhigh 72 36 0
med low 62 10 23
med 62 33 0
high 72 36 0
vhigh 72 36 0
high low 72 36 0
med 72 36 0
high 72 36 0
vhigh 108 0 0
vhigh low 72 36 0
med 72 36 0
high 108 0 0
vhigh 108 0 0
In [387]:
fig, ax = plt.subplots(figsize=(5,4))
#font_kwds = dict(fontsize = 12)
sns.heatmap(df_PricingVsClass)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 10) 
plt.xlabel(ax.get_xlabel(), fontsize = 8, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 8, fontweight = 'heavy')
Out[387]:
Text(33.222222222222214, 0.5, 'buying-maint')

Comfort vs Class Heatmaps

Comfort has three aspects - doors, luggage boot size and persons

  • All two persons car are classified as Unacceptable.
  • No specific pattern emerging from EDA of Comfort vs Class.
In [388]:
from eda.composite_plots import heatmap_plot
heatmap_plot(data, ['lug_boot'], ['class'])
In [389]:
pd.crosstab([data['doors'], data['lug_boot']], [data['class']])
Out[389]:
class unacc acc good
doors lug_boot
2 small 126 15 3
med 108 30 6
big 92 36 6
3 small 108 30 6
med 100 33 6
big 92 36 6
4 small 108 30 6
med 92 36 6
big 92 36 6
5more small 108 30 6
med 92 36 6
big 92 36 6
In [390]:
from eda.composite_plots import heatmap_plot
heatmap_plot(data, ['doors','lug_boot'], ['class'])
In [391]:
data.columns
Out[391]:
Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

Defining the Problem

This is a multiclass classification problem.
Data is imbalanced as all the classes are not equally represented.
Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.
Instead, we can use below performance measures :

  • Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.

Appropriateness of the performance measures :

  • Accuracy : Appropriate for Balanced datasets
  • Precision : Appropriate when minimising false positive is the focus
  • Recall : Appropriate when minimising false negatives is the focus.
  • F-measure provides a way to combine both precision and recall into a single measure that captures both properties.

Note : It is common practice to measure efficacy of binary classification models using the Area under the Curve (AUC) of the ROC curve. Multi class classicfication will require some tweaks in the ROC AUC approach. We are not pursuing that in this notebook. Read this for more info :https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

http://www.svds.com/learning-imbalanced-classes/

In [392]:
from eda.axes_utils import Add_valuecountsinfo, Add_data_labels

ax = data['class'].value_counts().plot(kind = 'bar', figsize=(3,3))
ax.yaxis.grid(True, alpha = 0.3)
ax.set(axisbelow = True)
Add_data_labels(ax.patches)
Add_valuecountsinfo(ax, 'class', data)
In [393]:
from eda import axes_utils
print(list(dir(axes_utils)))
['Add_data_labels', 'Add_valuecountsinfo', 'Change_barWidth', 'Highlight_Top_n_values', 'Set_axes_labels_titles', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'blended_transform_factory', 'create_string_for_plot', 'get_hist', 'gridspec', 'mdates', 'np', 'paddingString', 'pd', 'plt', 'sns', 'stylize_axes', 'time']

Approach for imbalanced datasets

Techniques to correctly distinguish the minority class can be categorized into four main groups, depending on how they deal with the problem. The four groups are :


  1. Algorithm level approaches (also called internal), try to adapt existing classifier learning algorithms to bias the learning toward the minority class. In order to perform the adaptation a special knowledge of both the corresponding classifier and the application domain is required so as to comprehend why the classifier fails when the class distribution is uneven. More details about these types of methods are given in Chap. 6.

  1. Data level (or external) approaches aim at rebalancing the class distribution by resampling the data space . This way, the modification of the learning algorithm is avoided since the effect caused by imbalance is decreased with a preprocessing step. These methods are discussed in depth in Chap. 5.

  1. Cost-sensitive learning framework falls between data and algorithm level approaches. Both data level transformations (by adding costs to instances) and algorithm level modifications (by modifying the learning process to accept costs) [13, 48, 86] are incorporated. The classifier is biased toward the minority class by assuming higher misclassification costs for this class and seeking to minimize the total cost errors of both classes. An overview of cost-sensitive approaches for the class imbalance problem is presented Chap. 4.

  1. Ensemble-based methods usually consist of a combination between an ensemble learning algorithm [59] and one of the techniques above, specifically, data level and cost-sensitive ones [27]. Adding a data level approach to the ensemble learning algorithm, the new hybrid method usually preprocesses the data before training each classifier, whereas cost-sensitive ensembles instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. Ensemble-based models are thoroughly described in Chap. 7.

Learning from imbalanced datasets by Alberto Fernandez et al

Applying Machine Learning Algorithms for classification

Training and testing the data

In [394]:
X_cols = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
Y_cols = ['class']

X = data[X_cols]
Y = data[Y_cols]
In [395]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)
In [396]:
X_train
Out[396]:
buying maint doors persons lug_boot safety
142 vhigh high 3 2 big med
1026 med high 4 2 small low
537 high vhigh 5more more big low
1298 low vhigh 2 2 small high
1296 low vhigh 2 2 small low
... ... ... ... ... ... ...
715 high med 4 4 med med
905 med vhigh 3 4 med high
1096 med med 2 4 big med
235 vhigh med 2 more small med
1061 med high 5more 2 big high

1296 rows × 6 columns

Preparing the data

In [397]:
# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)
Train (1296, 6) (1296, 1)
Test (432, 6) (432, 1)
In [398]:
#Verifying order of columns
#cat = ["buying","maint","doors","persons","lug_boot","safety"]
print(X_train.columns)
print(X_test.columns)
Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')
Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')
In [399]:
from sklearn.preprocessing import OrdinalEncoder

#https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py

# prepare input data
def prepare_inputs(X_train, X_test): 
    oe = OrdinalEncoder(categories = [['low','med','high', 'vhigh' ],
                         ['low','med','high', 'vhigh' ],
                         ['2', '3', '4', '5more'],
                         ['2', '4', 'more'],
                         ['small', 'med', 'big'],
                         ['low', 'med', 'high']]
                       )
    oe.fit(X_train)
    X_train_enc = oe.transform(X_train) 
    X_test_enc = oe.transform(X_test) 
    
    return X_train_enc, X_test_enc
In [400]:
from sklearn.preprocessing import LabelEncoder

# prepare input data
def prepare_targets(y_train, y_test): 
    le = LabelEncoder()
    le.fit(np.ravel(y_train))
    y_train_enc = le.transform(np.ravel(y_train)) 
    y_test_enc = le.transform(np.ravel(y_test)) 
    
    return y_train_enc, y_test_enc
In [401]:
from sklearn.preprocessing import OrdinalEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Feature Selection (Optional)

Comparison of below three scenarios do not justify feature selection methods.

  • Accuracy when all features are considered : 81.02%
  • Accuracy when 4 features selected through "information gain" considered
  • Accuracy when 4 features selected through "Chi Squared" considered

Note we have applied Logistic regression classifier on all three scenarios to check any significant movement in the accuracy scores.

In [402]:
# fit the model using all the features
model = LogisticRegression(solver='lbfgs', max_iter = 1000)
model.fit(X_train_enc, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_enc)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
Accuracy: 81.02

example of mutual information feature selection for categorical data

In [403]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
In [404]:
# feature selection
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=mutual_info_classif, k=4) 
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs
In [405]:
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)
In [406]:
# what are scores for the features
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
Feature 0: 0.053196
Feature 1: 0.022979
Feature 2: 0.000000
Feature 3: 0.146334
Feature 4: 0.005870
Feature 5: 0.179178
In [407]:
# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_) 
plt.show()
In [408]:
# fit the model
model = LogisticRegression(solver='lbfgs',max_iter= 1000)
model.fit(X_train_fs, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
Accuracy: 81.25

example of chi squared feature selection for categorical data

In [409]:
from sklearn.feature_selection import chi2
# feature selection
def select_features2(X_train, y_train, X_test): 
    fs = SelectKBest(score_func=chi2, k=4)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs
In [410]:
# feature selection
X_train_fs2, X_test_fs2, fs2 = select_features2(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs2.scores_)):
    print('Feature %d: %f' % (i, fs2.scores_[i]))

# plot the scores
plt.bar([i for i in range(len(fs2.scores_))], fs2.scores_) 
plt.show()
Feature 0: 107.312113
Feature 1: 71.158014
Feature 2: 5.761583
Feature 3: 136.535729
Feature 4: 27.666582
Feature 5: 197.004190
In [411]:
# fit the model
model = LogisticRegression(solver='lbfgs',max_iter= 1000)
model.fit(X_train_fs2, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_fs2)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
Accuracy: 81.25
  • k-Nearest Neighbors
  • Decision Trees
  • Naive Bayes
  • Random Forest
  • Gradient Boosting

Useful links :

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

In [412]:
# Sample code for calculating time

st_time = time.time()

#code.......
#code.......

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
Total time: 0.00s
In [413]:
# Refer to section 8 for the evaluation metrics

def evaluation_parametrics(y_train,yp_train,y_test,yp_test):
  print("--------------------------------------------------------------------------")
  print("Classification Report for Train Data")
  print(classification_report(y_train, yp_train))
  print("Classification Report for Test Data")
  print(classification_report(y_test, yp_test))
  print("--------------------------------------------------------------------------")
  # Accuracy
  print("Accuracy on Train Data is: {}".format(round(accuracy_score(y_train,yp_train),2)))
  print("Accuracy on Test Data is: {}".format(round(accuracy_score(y_test,yp_test),2)))
  print("--------------------------------------------------------------------------")
  # Precision
  print("Precision on Train Data is: {}".format(round(precision_score(y_train,yp_train,average = "weighted"),2)))
  print("Precision on Test Data is: {}".format(round(precision_score(y_test,yp_test,average = "weighted"),2)))
  print("--------------------------------------------------------------------------")
  # Recall 
  print("Recall on Train Data is: {}".format(round(recall_score(y_train,yp_train,average = "weighted"),2)))
  print("Recall on Test Data is: {}".format(round(recall_score(y_test,yp_test,average = "weighted"),2)))
  print("--------------------------------------------------------------------------")
  # F1 Score
  print("F1 Score on Train Data is: {}".format(round(f1_score(y_train,yp_train,average = "weighted"),2)))
  print("F1 Score on Test Data is: {}".format(round(f1_score(y_test,yp_test,average = "weighted"),2)))
  print("--------------------------------------------------------------------------")
In [414]:
# Creating custom method to populate a dictionary object with evaluation scores of the classifiers 

def create_dict(model, modelname, y_train, yp_train, y_test, yp_test):
    dict1 = {modelname :  {"F1" : {"Train": float(np.round(f1_score(y_train,yp_train,average = "weighted"),2)),
                                  "Test": float(np.round(f1_score(y_test,yp_test,average = "weighted"),2))},
                            "Recall": {"Train": float(np.round(recall_score(y_train,yp_train,average = "weighted"),2)),
                                       "Test": float(np.round(recall_score(y_test,yp_test,average = "weighted"),2))},
                            "Precision" :{"Train": float(np.round(precision_score(y_train,yp_train,average = "weighted"),2)),
                                        "Test": float(np.round(precision_score(y_test,yp_test,average = "weighted"),2))
                                       }}
                          
            }
    return dict1

dict = {}
In [415]:
def plot_learning_curve(
    estimator,
    title,
    X,
    y,
    axes=None,
    ylim=None,
    cv=None,
    n_jobs=None,
    train_sizes=np.linspace(0.1, 1.0, 5),
):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(
        train_sizes,
        train_scores_mean - train_scores_std,
        train_scores_mean + train_scores_std,
        alpha=0.1,
        color="r",
    )
    axes[0].fill_between(
        train_sizes,
        test_scores_mean - test_scores_std,
        test_scores_mean + test_scores_std,
        alpha=0.1,
        color="g",
    )
    axes[0].plot(
        train_sizes, train_scores_mean, "o-", color="r", label="Training score"
    )
    axes[0].plot(
        train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
    )
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, "o-")
    axes[1].fill_between(
        train_sizes,
        fit_times_mean - fit_times_std,
        fit_times_mean + fit_times_std,
        alpha=0.1,
    )
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    fit_time_argsort = fit_times_mean.argsort()
    fit_time_sorted = fit_times_mean[fit_time_argsort]
    test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
    test_scores_std_sorted = test_scores_std[fit_time_argsort]
    axes[2].grid()
    axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
    axes[2].fill_between(
        fit_time_sorted,
        test_scores_mean_sorted - test_scores_std_sorted,
        test_scores_mean_sorted + test_scores_std_sorted,
        alpha=0.1,
    )
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

Logistic Regression Classifier

In [416]:
lr = LogisticRegression(max_iter = 1000,random_state = 48, multi_class = 'multinomial')

st_time = time.time()
lr.fit(X_train_enc,y_train_enc)

yp_train_enc = lr.predict(X_train_enc)
yp_test_enc = lr.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(lr,X_test_enc, y_test_enc)

dict1 = create_dict(lr, "Logistic Regression Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(lr, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Logistic Regression Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
Total time: 0.06s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.68      0.63      0.66       296
           1       0.60      0.41      0.49        51
           2       0.89      0.93      0.91       900
           3       0.80      0.71      0.75        49

    accuracy                           0.83      1296
   macro avg       0.74      0.67      0.70      1296
weighted avg       0.83      0.83      0.83      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.60      0.53      0.57        88
           1       0.42      0.28      0.33        18
           2       0.88      0.93      0.90       310
           3       0.79      0.69      0.73        16

    accuracy                           0.81       432
   macro avg       0.67      0.61      0.63       432
weighted avg       0.80      0.81      0.80       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.83
Accuracy on Test Data is: 0.81
--------------------------------------------------------------------------
Precision on Train Data is: 0.83
Precision on Test Data is: 0.8
--------------------------------------------------------------------------
Recall on Train Data is: 0.83
Recall on Test Data is: 0.81
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.83
F1 Score on Test Data is: 0.8
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[416]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Decision Tree Classifier

In [417]:
dt = DecisionTreeClassifier(max_depth = 7,random_state = 48) # Keeping max_depth = 7 to avoid overfitting
dt.fit(X_train_enc,y_train_enc)

yp_train_enc = dt.predict(X_train_enc)
yp_test_enc = dt.predict(X_test_enc)

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(dt,X_test_enc, y_test_enc)

dict1 = create_dict(dt, "Decision Tree Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(dt, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Decision Tree Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.84      0.96      0.89       296
           1       0.83      0.59      0.69        51
           2       0.99      0.96      0.98       900
           3       0.79      0.78      0.78        49

    accuracy                           0.94      1296
   macro avg       0.86      0.82      0.84      1296
weighted avg       0.94      0.94      0.94      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.78      0.94      0.85        88
           1       0.83      0.56      0.67        18
           2       0.99      0.95      0.97       310
           3       0.88      0.88      0.88        16

    accuracy                           0.93       432
   macro avg       0.87      0.83      0.84       432
weighted avg       0.94      0.93      0.93       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.94
Accuracy on Test Data is: 0.93
--------------------------------------------------------------------------
Precision on Train Data is: 0.94
Precision on Test Data is: 0.94
--------------------------------------------------------------------------
Recall on Train Data is: 0.94
Recall on Test Data is: 0.93
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.94
F1 Score on Test Data is: 0.93
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[417]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

K Nearest Neighbors Classifier

In [418]:
# training a KNN classifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 7)

st_time = time.time()

knn.fit(X_train_enc,y_train_enc)

yp_train_enc = knn.predict(X_train_enc)
yp_test_enc = knn.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(knn,X_test_enc, y_test_enc)

dict1 = create_dict(knn, "K Nearest Neighbor Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(knn, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Knn Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
Total time: 0.08s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.94      0.97      0.96       296
           1       0.96      0.84      0.90        51
           2       0.99      0.99      0.99       900
           3       1.00      0.90      0.95        49

    accuracy                           0.98      1296
   macro avg       0.97      0.93      0.95      1296
weighted avg       0.98      0.98      0.98      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.86      0.94      0.90        88
           1       0.92      0.67      0.77        18
           2       0.98      0.98      0.98       310
           3       1.00      0.81      0.90        16

    accuracy                           0.95       432
   macro avg       0.94      0.85      0.89       432
weighted avg       0.95      0.95      0.95       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.98
Accuracy on Test Data is: 0.95
--------------------------------------------------------------------------
Precision on Train Data is: 0.98
Precision on Test Data is: 0.95
--------------------------------------------------------------------------
Recall on Train Data is: 0.98
Recall on Test Data is: 0.95
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.98
F1 Score on Test Data is: 0.95
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[418]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Naive Bayes Classifier

In [419]:
# training a Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

st_time = time.time()

gnb.fit(X_train_enc,y_train_enc)

yp_train_enc = gnb.predict(X_train_enc)
yp_test_enc = gnb.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(gnb,X_test_enc, y_test_enc)

dict1 = create_dict(gnb, "Naive Bayes Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(gnb, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Naive Bayes Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
Total time: 0.00s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.67      0.23      0.35       296
           1       0.55      0.24      0.33        51
           2       0.87      0.87      0.87       900
           3       0.18      1.00      0.30        49

    accuracy                           0.70      1296
   macro avg       0.57      0.58      0.46      1296
weighted avg       0.78      0.70      0.71      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.58      0.20      0.30        88
           1       0.50      0.17      0.25        18
           2       0.87      0.87      0.87       310
           3       0.19      1.00      0.32        16

    accuracy                           0.71       432
   macro avg       0.54      0.56      0.44       432
weighted avg       0.77      0.71      0.71       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.7
Accuracy on Test Data is: 0.71
--------------------------------------------------------------------------
Precision on Train Data is: 0.78
Precision on Test Data is: 0.77
--------------------------------------------------------------------------
Recall on Train Data is: 0.7
Recall on Test Data is: 0.71
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.71
F1 Score on Test Data is: 0.71
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[419]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Random Forest Classifier

In [420]:
rf = RandomForestClassifier(max_depth = 7,random_state = 48) # Keeping max_depth = 7 same as DT

st_time = time.time()

rf.fit(X_train_enc,y_train_enc)

yp_train_enc = rf.predict(X_train_enc)
yp_test_enc = rf.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(rf,X_test_enc, y_test_enc)

dict1 = create_dict(rf, "Random Forest Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(rf, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Random Forest Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
Total time: 0.17s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       296
           1       0.93      0.84      0.89        51
           2       1.00      0.99      0.99       900
           3       0.93      0.76      0.83        49

    accuracy                           0.97      1296
   macro avg       0.94      0.89      0.92      1296
weighted avg       0.98      0.97      0.97      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.81      0.93      0.87        88
           1       0.86      0.67      0.75        18
           2       0.99      0.96      0.97       310
           3       0.93      0.88      0.90        16

    accuracy                           0.94       432
   macro avg       0.90      0.86      0.87       432
weighted avg       0.94      0.94      0.94       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.97
Accuracy on Test Data is: 0.94
--------------------------------------------------------------------------
Precision on Train Data is: 0.98
Precision on Test Data is: 0.94
--------------------------------------------------------------------------
Recall on Train Data is: 0.97
Recall on Test Data is: 0.94
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.97
F1 Score on Test Data is: 0.94
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[420]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Linear SVC Classifier

In [421]:
svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)

st_time = time.time()
svm.fit(X_train_enc,y_train_enc)

yp_train_enc = svm.predict(X_train_enc)
yp_test_enc = svm.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(svm,X_test_enc, y_test_enc)

dict1 = create_dict(svm, "Linear SVC", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(svm, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Linear SVC Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
Total time: 0.01s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       0.78      0.66      0.71       296
           1       0.48      0.94      0.63        51
           2       0.91      0.89      0.90       900
           3       0.61      0.76      0.67        49

    accuracy                           0.84      1296
   macro avg       0.69      0.81      0.73      1296
weighted avg       0.85      0.84      0.84      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.74      0.64      0.68        88
           1       0.50      1.00      0.67        18
           2       0.90      0.88      0.89       310
           3       0.59      0.62      0.61        16

    accuracy                           0.83       432
   macro avg       0.68      0.79      0.71       432
weighted avg       0.84      0.83      0.83       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 0.84
Accuracy on Test Data is: 0.83
--------------------------------------------------------------------------
Precision on Train Data is: 0.85
Precision on Test Data is: 0.84
--------------------------------------------------------------------------
Recall on Train Data is: 0.84
Recall on Test Data is: 0.83
--------------------------------------------------------------------------
F1 Score on Train Data is: 0.84
F1 Score on Test Data is: 0.83
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[421]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>

Gradient Boosting

In [422]:
gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=10)

st_time = time.time()
gb_model.fit(X_train_enc,y_train_enc)

yp_train_enc = gb_model.predict(X_train_enc)
yp_test_enc = gb_model.predict(X_test_enc)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(svm,X_test_enc, y_test_enc)

dict1 = create_dict(gb_model, "Gradient Boosting", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)

plot_learning_curve(gb_model, X = X_train_enc, y = y_train_enc, 
                    title =  "Learning Curves (Gradient Boosting Classifier)", 
                    train_sizes=np.linspace(0.1, 1.0, 5))
Total time: 0.96s
--------------------------------------------------------------------------
Classification Report for Train Data
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       296
           1       1.00      1.00      1.00        51
           2       1.00      1.00      1.00       900
           3       1.00      1.00      1.00        49

    accuracy                           1.00      1296
   macro avg       1.00      1.00      1.00      1296
weighted avg       1.00      1.00      1.00      1296

Classification Report for Test Data
              precision    recall  f1-score   support

           0       0.93      0.98      0.96        88
           1       0.93      0.78      0.85        18
           2       0.98      0.98      0.98       310
           3       1.00      0.94      0.97        16

    accuracy                           0.97       432
   macro avg       0.96      0.92      0.94       432
weighted avg       0.97      0.97      0.97       432

--------------------------------------------------------------------------
Accuracy on Train Data is: 1.0
Accuracy on Test Data is: 0.97
--------------------------------------------------------------------------
Precision on Train Data is: 1.0
Precision on Test Data is: 0.97
--------------------------------------------------------------------------
Recall on Train Data is: 1.0
Recall on Test Data is: 0.97
--------------------------------------------------------------------------
F1 Score on Train Data is: 1.0
F1 Score on Test Data is: 0.97
--------------------------------------------------------------------------
/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[422]:
<module 'matplotlib.pyplot' from '/Users/bhaskarroy/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py'>
In [423]:
st_time = time.time()
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
Total time: 0.00s

Listing the performance from all the models

In [424]:
# Retrieving the perfomance scores from the dict object 
pd.DataFrame.from_dict({(i,j): dict[i][j] 
                           for i in dict.keys() 
                           for j in dict[i].keys()},
                       orient='index')
Out[424]:
Train Test
Logistic Regression Classifier F1 0.83 0.80
Recall 0.83 0.81
Precision 0.83 0.80
Decision Tree Classifier F1 0.94 0.93
Recall 0.94 0.93
... ... ... ...
Linear SVC Recall 0.84 0.83
Precision 0.85 0.84
Gradient Boosting F1 1.00 0.97
Recall 1.00 0.97
Precision 1.00 0.97

21 rows × 2 columns

In [425]:
# Retrieving the perfomance scores from the dict object
# Transposing the rows and headers with respect to the previous code line
# Tabulating the scores for the different classifiers

user_ids = []
frames = []

for user_id, d in dict.items():
    user_ids.append(user_id)
    frames.append(pd.DataFrame.from_dict(d, orient='columns'))

df = pd.concat(frames, keys=user_ids)
df.unstack(level = -1).style.background_gradient(cmap='Blues')
Out[425]:
F1 Recall Precision
Train Test Train Test Train Test
Logistic Regression Classifier 0.830000 0.800000 0.830000 0.810000 0.830000 0.800000
Decision Tree Classifier 0.940000 0.930000 0.940000 0.930000 0.940000 0.940000
K Nearest Neighbor Classifier 0.980000 0.950000 0.980000 0.950000 0.980000 0.950000
Naive Bayes Classifier 0.710000 0.710000 0.700000 0.710000 0.780000 0.770000
Random Forest Classifier 0.970000 0.940000 0.970000 0.940000 0.980000 0.940000
Linear SVC 0.840000 0.830000 0.840000 0.830000 0.850000 0.840000
Gradient Boosting 1.000000 0.970000 1.000000 0.970000 1.000000 0.970000

Conclusion

  • Decision Tree Classifier, K Nearest Neighbours, Random Forest Classifiers, Gradient Boosting have performed high in F1, Recall and Precision measures for both test and train methods.
  • Gradient Boosting has given the best performance. However, note that this is accompanied with lack of interpretability and explainability.
  • Logistic Regression, Linear SVC followed by Naive Bayes have low scores on the evaluation metrics.

Resources to follow for imbalanced learning

Datasets

  • Breast Cancer Wisconsin dataset
  • Credit Card Fraud detection Kaggle dataset

Real life scenarios with data imbalance

Learning from imbalanced data: open challenges and future directions by Bartosz Krawczyk

Application Area Problem Description
Activity Recognition Detection of rare or less-frequent activities (multi-class problem)
Behavior Analysis Recognition of dangerous behavior (binary problem)
Cancer Malignancy grading Analyzing the cancer severity (binary and multi-class problem)
Hyperspectral data analysis Classification of varying areas in multi-dimensional images (multi-class problem)
Industrial Systems monitoring Fault detection in industrial machinery (binary problem)
Sentiment analysis Emotion and temper recognition in text (binary and multi-class problem)
Software defect prediction Recognition of errors in code blocks (binary problem)
Target detection Classification of specified targets appearing with varied frequency (multi-class problem)
Text mining Detecting relations in literature (binary problem)
Video mining Recognizing objects and actions in video sequences (binary and multi-class problem)