Credit Screening Dataset

This dataset has been downloaded from UC Irvine Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/Credit+Approval

This dataset is regarding credit card applications.
The target variable/label is whether the application has been granted credit or not.
All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

The objective is here to build a model to give binary output based on the input attributes.

Summary of Key information

Number of Instances/training examples          : 690  
Number of Instances with missing attributes    :  37  
Number of qualified Instances/training examples : 653

Number of Input Attributes                     : 15
Number of categorical attributes               :  9
Number of numerical attributes                 :  6

Target Attribute Type                          : Binary Class
Target Class distribution                      : 54%:45%
Problem Identification                         : Binary Classification with balanced data set
In [2]:
import os
print(os.environ['PATH'])
/usr/local/lib/ruby/gems/3.1.0/bin:/usr/local/opt/ruby/bin:/usr/local/lib/ruby/gems/3.1.0/bin:/usr/local/opt/ruby/bin:/Users/bhaskarroy/opt/anaconda3/bin:/Users/bhaskarroy/opt/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/usr/local/share/dotnet:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/usr/local/mysql/bin
In [3]:
from notebook.services.config import ConfigManager
cm = ConfigManager().update('notebook', {'limit_output': 20})

Loading necessary libraries

In [4]:
import numpy as np
import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt

from eda import eda_overview, axes_utils

import category_encoders as ce 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OrdinalEncoder

from sklearn.model_selection import train_test_split, learning_curve, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.metrics import recall_score, precision_score, accuracy_score,confusion_matrix, ConfusionMatrixDisplay, classification_report, f1_score
In [5]:
pd.set_option('display.max_rows', 20)
pd.set_option('precision', 4)

Importing the dataset

In [6]:
path = "/Users/bhaskarroy/BHASKAR FILES/BHASKAR CAREER/Data Science/Practise/"  \
       "Python/UCI Machine Learning Repository/Credit Screening/"
In [7]:
# Index
# credit.lisp
# credit.names
# crx.data
# crx.names


path1 = path + "crx.data" 
path_name = path + "credit.names"
path_crxname = path + "crx.names"
In [8]:
datContent = [i.strip().split() for i in open(path1).readlines()]
In [9]:
len(datContent)
Out[9]:
690
In [10]:
print(dir(type(datContent[0][0])))
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
In [12]:
# Inspecting the contents
print(datContent[0][0].split(sep = ","))
['b', '30.83', '0', 'u', 'g', 'w', 'v', '1.25', 't', 't', '01', 'f', 'g', '00202', '0', '+']
In [13]:
len(datContent[0])
Out[13]:
1

Dataset Information

In [13]:
# Opening the file credit.names for the description of data set 
with open(path_name) as f:
    print(f.read())
1. Title: Japanese Credit Screening (examples & domain theory)

2. Source information:
   -- Creators: Chiharu Sano 
   -- Donor: Chiharu Sano
             csano@bonnie.ICS.UCI.EDU
   -- Date: 3/19/92

3. Past usage: 
   -- None Published

4. Relevant information:
   --  Examples represent positive and negative instances of people who were and were not 
       granted credit.
   --  The theory was generated by talking to the individuals at a Japanese company that grants
       credit.

5. Number of instances: 125




Attributes Information

In [14]:
# Opening the file crx.names for the description of data set 
with open(path_crxname) as f:
    print(f.read())
1. Title: Credit Approval

2. Sources: 
    (confidential)
    Submitted by quinlan@cs.su.oz.au

3.  Past Usage:

    See Quinlan,
    * "Simplifying decision trees", Int J Man-Machine Studies 27,
      Dec 1987, pp. 221-234.
    * "C4.5: Programs for Machine Learning", Morgan Kaufmann, Oct 1992
  
4.  Relevant Information:

    This file concerns credit card applications.  All attribute names
    and values have been changed to meaningless symbols to protect
    confidentiality of the data.
  
    This dataset is interesting because there is a good mix of
    attributes -- continuous, nominal with small numbers of
    values, and nominal with larger numbers of values.  There
    are also a few missing values.
  
5.  Number of Instances: 690

6.  Number of Attributes: 15 + class attribute

7.  Attribute Information:

    A1:	b, a.
    A2:	continuous.
    A3:	continuous.
    A4:	u, y, l, t.
    A5:	g, p, gg.
    A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
    A7:	v, h, bb, j, n, z, dd, ff, o.
    A8:	continuous.
    A9:	t, f.
    A10:	t, f.
    A11:	continuous.
    A12:	t, f.
    A13:	g, p, s.
    A14:	continuous.
    A15:	continuous.
    A16: +,-         (class attribute)

8.  Missing Attribute Values:
    37 cases (5%) have one or more missing values.  The missing
    values from particular attributes are:

    A1:  12
    A2:  12
    A4:   6
    A5:   6
    A6:   9
    A7:   9
    A14: 13

9.  Class Distribution
  
    +: 307 (44.5%)
    -: 383 (55.5%)


In [15]:
with open(path+"Index") as f:
    print(f.read())
Index of credit-screening

02 Dec 1996      182 Index
19 Sep 1992    32218 crx.data
19 Sep 1992     1486 crx.names
16 Jul 1992    12314 credit.lisp
16 Jul 1992      522 credit.names

In [1]:
#with open(path+"credit.lisp") as f:
#    print(f.read())

Data preprocessing

Following actions were undertaken:

  • Converting to Dataframe Format
  • As attribute names are anonymised, create standard feature name starting with 'A' and suffixed with feature number
  • Handling Missing values : 37 rows had missing values and are not being considered for model building
  • Converting Class symbols of Target variable to binary values
  • Processing Continous Attributes : Based on inspection, continuous attributes have been converted to float type.

Converting to Dataframe Format

In [17]:
# Inspecting the data
# We find that all the elements in a row is fused as one element.
# We need to use comma for splitting
datContent[0:5]
Out[17]:
[['b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+'],
 ['a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560,+'],
 ['a,24.50,0.5,u,g,q,h,1.5,t,f,0,f,g,00280,824,+'],
 ['b,27.83,1.54,u,g,w,v,3.75,t,t,05,t,g,00100,3,+'],
 ['b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+']]
In [18]:
# Splitting using comma to get individual elements
print(datContent[0][0].split(sep = ","))
['b', '30.83', '0', 'u', 'g', 'w', 'v', '1.25', 't', 't', '01', 'f', 'g', '00202', '0', '+']
In [19]:
# The Number of attributes/features is 16
attrCount = len(datContent[0][0].split(sep = ","))
attrCount
Out[19]:
16
In [20]:
# As all features names have been changed/anonymised, 
# we will create standard feature name starting with 'A' and suffixed with feature number
colNames = ["A"+str(i+1) for i in range(attrCount)]
print(colNames)
['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16']
In [21]:
# Extracting values/data that will be passed as data to create the Dataframe
rawData = []

for i in datContent:
    for j in i:
        rawData.append(j.split(sep = ","))      
In [22]:
# Creating the Dataframe
df = pd.DataFrame(rawData, columns = colNames)

# Inspecting the Dataframe
df.head()
Out[22]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0 u g w v 1.25 t t 01 f g 00202 0 +
1 a 58.67 4.46 u g q h 3.04 t t 06 f g 00043 560 +
2 a 24.50 0.5 u g q h 1.5 t f 0 f g 00280 824 +
3 b 27.83 1.54 u g w v 3.75 t t 05 t g 00100 3 +
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 +
In [23]:
# Inspecting the dataframe 
# We find that features 'A2','A16' have symbols that would require further preprocessing 
df.describe()
Out[23]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
count 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690
unique 3 350 215 4 4 15 10 132 2 2 23 2 3 171 240 2
top b ? 1.5 u g c v 0 t f 0 f g 00000 0 -
freq 468 12 21 519 519 137 399 70 361 395 395 374 625 132 295 383
In [24]:
# Checking the datatypes to decide the datatype conversions required feature wise
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A1      690 non-null    object
 1   A2      690 non-null    object
 2   A3      690 non-null    object
 3   A4      690 non-null    object
 4   A5      690 non-null    object
 5   A6      690 non-null    object
 6   A7      690 non-null    object
 7   A8      690 non-null    object
 8   A9      690 non-null    object
 9   A10     690 non-null    object
 10  A11     690 non-null    object
 11  A12     690 non-null    object
 12  A13     690 non-null    object
 13  A14     690 non-null    object
 14  A15     690 non-null    object
 15  A16     690 non-null    object
dtypes: object(16)
memory usage: 86.4+ KB

Handling Missing values

In [25]:
#df['A2'].astype("float")
df1 = df[(df == "?").any(axis = 1)]
In [26]:
df1
Out[26]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
71 b 34.83 4 u g d bb 12.5 t f 0 t g ? 0 -
83 a ? 3.5 u g d v 3 t f 0 t g 00300 0 -
86 b ? 0.375 u g d v 0.875 t f 0 t s 00928 0 -
92 b ? 5 y p aa v 8.5 t f 0 f g 00000 0 -
97 b ? 0.5 u g c bb 0.835 t f 0 t s 00320 0 -
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
608 b ? 0.04 y p d v 4.25 f f 0 t g 00460 0 -
622 a 25.58 0 ? ? ? ? 0 f f 0 f p ? 0 +
626 b 22.00 7.835 y p i bb 0.165 f f 0 t g ? 0 -
641 ? 33.17 2.25 y p cc v 3.5 f f 0 t g 00200 141 -
673 ? 29.50 2 y p e h 2 f f 0 f g 00256 17 -

37 rows × 16 columns

In [27]:
# Selecting a subset without any missing values
df2 = df[(df != "?").all(axis = 1)]
df2.shape
Out[27]:
(653, 16)
In [28]:
df2.head()
Out[28]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0 u g w v 1.25 t t 01 f g 00202 0 +
1 a 58.67 4.46 u g q h 3.04 t t 06 f g 00043 560 +
2 a 24.50 0.5 u g q h 1.5 t f 0 f g 00280 824 +
3 b 27.83 1.54 u g w v 3.75 t t 05 t g 00100 3 +
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 +

Converting Class symbols of Target variable to binary values

In [29]:
# Below code may return Setting with Copy warning
# Use df._is_view to check if a dataframe is a view or copy 
# df2.loc[:, 'A16'] = df2['A16'].map({"-": 0, "+":1}).values

# Use df.assign instead.
# https://stackoverflow.com/questions/36846060/how-to-replace-an-entire-column-on-pandas-dataframe
df2 = df2.assign(A16 = df2['A16'].map({"-": 0, "+":1}))
In [30]:
df2
Out[30]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0 u g w v 1.25 t t 01 f g 00202 0 1
1 a 58.67 4.46 u g q h 3.04 t t 06 f g 00043 560 1
2 a 24.50 0.5 u g q h 1.5 t f 0 f g 00280 824 1
3 b 27.83 1.54 u g w v 3.75 t t 05 t g 00100 3 1
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
685 b 21.08 10.085 y p e h 1.25 f f 0 f g 00260 0 0
686 a 22.67 0.75 u g c v 2 f t 02 t g 00200 394 0
687 a 25.25 13.5 y p ff ff 2 f t 01 t g 00200 1 0
688 b 17.92 0.205 u g aa v 0.04 f f 0 f g 00280 750 0
689 b 35.00 3.375 u g c h 8.29 f f 0 t g 00000 0 0

653 rows × 16 columns

In [31]:
df2
Out[31]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0 u g w v 1.25 t t 01 f g 00202 0 1
1 a 58.67 4.46 u g q h 3.04 t t 06 f g 00043 560 1
2 a 24.50 0.5 u g q h 1.5 t f 0 f g 00280 824 1
3 b 27.83 1.54 u g w v 3.75 t t 05 t g 00100 3 1
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 00120 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
685 b 21.08 10.085 y p e h 1.25 f f 0 f g 00260 0 0
686 a 22.67 0.75 u g c v 2 f t 02 t g 00200 394 0
687 a 25.25 13.5 y p ff ff 2 f t 01 t g 00200 1 0
688 b 17.92 0.205 u g aa v 0.04 f f 0 f g 00280 750 0
689 b 35.00 3.375 u g c h 8.29 f f 0 t g 00000 0 0

653 rows × 16 columns

In [32]:
from eda import datasets
datasets.credit_screening()
Out[32]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.000 u g w v 1.25 t t 1.0 f g 202.0 0.0 1
1 a 58.67 4.460 u g q h 3.04 t t 6.0 f g 43.0 560.0 1
2 a 24.50 0.500 u g q h 1.50 t f 0.0 f g 280.0 824.0 1
3 b 27.83 1.540 u g w v 3.75 t t 5.0 t g 100.0 3.0 1
4 b 20.17 5.625 u g w v 1.71 t f 0.0 f s 120.0 0.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
685 b 21.08 10.085 y p e h 1.25 f f 0.0 f g 260.0 0.0 0
686 a 22.67 0.750 u g c v 2.00 f t 2.0 t g 200.0 394.0 0
687 a 25.25 13.500 y p ff ff 2.00 f t 1.0 t g 200.0 1.0 0
688 b 17.92 0.205 u g aa v 0.04 f f 0.0 f g 280.0 750.0 0
689 b 35.00 3.375 u g c h 8.29 f f 0.0 t g 0.0 0.0 0

653 rows × 16 columns

Processing Continous Attributes

In [33]:
# Continous Variables are A2, A3, A11, A14, A15
contAttr = ['A2', 'A3','A8', 'A11', 'A14', 'A15']
In [34]:
for i in contAttr:
    df2.loc[:,i] = df2[i].astype("float")
In [35]:
df2
Out[35]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.000 u g w v 1.25 t t 1.0 f g 202.0 0.0 1
1 a 58.67 4.460 u g q h 3.04 t t 6.0 f g 43.0 560.0 1
2 a 24.50 0.500 u g q h 1.50 t f 0.0 f g 280.0 824.0 1
3 b 27.83 1.540 u g w v 3.75 t t 5.0 t g 100.0 3.0 1
4 b 20.17 5.625 u g w v 1.71 t f 0.0 f s 120.0 0.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
685 b 21.08 10.085 y p e h 1.25 f f 0.0 f g 260.0 0.0 0
686 a 22.67 0.750 u g c v 2.00 f t 2.0 t g 200.0 394.0 0
687 a 25.25 13.500 y p ff ff 2.00 f t 1.0 t g 200.0 1.0 0
688 b 17.92 0.205 u g aa v 0.04 f f 0.0 f g 280.0 750.0 0
689 b 35.00 3.375 u g c h 8.29 f f 0.0 t g 0.0 0.0 0

653 rows × 16 columns

Univariate Analysis - Continous Variables

Findings from the distribution of numeric variables at overall level and considering the application status are as below:

  • The dispersion/standard deviation of numeric variables for the applications granted credit extends over a wide range.
  • The shape of distribution is similar in both groups for the variables 'A2', 'A3' and 'A14'.
  • In particular, Numeric variables 'A11' and 'A15' is concentrated to a very narrow range for the applications not granted credit.
In [36]:
eda_overview.UVA_numeric(data = df2, var_group = contAttr)
In [37]:
# Apply the default theme
sns.set_theme()
t = eda_overview.UVA_numeric_classwise(df2, 'A16', ['A16'], 
                                       colcount = 3, colwidth = 3,
                                       rowheight = 3,
                                       plot_type = 'histogram', element = 'step')

plt.gcf().savefig(path+'Numeric_interaction_class.png', dpi = 150)
In [38]:
t = eda_overview.distribution_comparison(df2, 'A16',['A16'])[0]
t
Out[38]:
Value Maximum Minimum Range Standard Deviation Unique Value count
A16 category 0 1 0 1 0 1 0 1 0 1
Continous Attributes
A2 74.830 76.75 15.17 13.75 59.660 63.0 10.7192 12.6894 222 219
A3 26.335 28.00 0.00 0.00 26.335 28.0 4.3931 5.4927 146 146
A8 13.875 28.50 0.00 0.00 13.875 28.5 2.0293 4.1674 67 117
A11 20.000 67.00 0.00 0.00 20.000 67.0 1.9584 6.3981 12 23
A14 2000.000 840.00 0.00 0.00 2000.000 840.0 172.0580 162.5435 100 108
A15 5552.000 100000.00 0.00 0.00 5552.000 100000.0 632.7817 7660.9492 110 145
In [39]:
t.to_csv(path +'NumericDistributionComparison.csv')
In [40]:
# Inspecting number of unique values
df2[contAttr].nunique()
Out[40]:
A2     340
A3     213
A8     131
A11     23
A14    164
A15    229
dtype: int64

Bivariate Analysis - Continous Variables

Findings from the correlation plot are as below :

  • No significant correlation between any pair of the features
  • No significant correlation between any pair of feature and target
In [41]:
# Continous Variables are A2, A3, A11, A14, A15
contAttr = ['A2', 'A3','A8', 'A11', 'A14', 'A15']

# Target Variable is A16
targetAttr = ['A16']
In [42]:
df2[contAttr+targetAttr]
Out[42]:
A2 A3 A8 A11 A14 A15 A16
0 30.83 0.000 1.25 1.0 202.0 0.0 1
1 58.67 4.460 3.04 6.0 43.0 560.0 1
2 24.50 0.500 1.50 0.0 280.0 824.0 1
3 27.83 1.540 3.75 5.0 100.0 3.0 1
4 20.17 5.625 1.71 0.0 120.0 0.0 1
... ... ... ... ... ... ... ...
685 21.08 10.085 1.25 0.0 260.0 0.0 0
686 22.67 0.750 2.00 2.0 200.0 394.0 0
687 25.25 13.500 2.00 1.0 200.0 1.0 0
688 17.92 0.205 0.04 0.0 280.0 750.0 0
689 35.00 3.375 8.29 0.0 0.0 0.0 0

653 rows × 7 columns

In [43]:
# Bivariate analysis at overall level

plt.rcdefaults()
#sns.set('notebook')
#sns.set_theme(style = 'whitegrid')
sns.set_context(font_scale = 0.6)
from pandas.plotting import scatter_matrix
scatter_matrix(df2[contAttr+targetAttr], figsize = (12,8));
In [44]:
# Bivariate analysis taking into account the target categories

#sns.set('notebook')
sns.set_theme(style="darkgrid")
sns.pairplot(df2[contAttr+targetAttr],hue= 'A16',height = 1.5)
Out[44]:
<seaborn.axisgrid.PairGrid at 0x7fc08ae9e790>
In [45]:
df2[contAttr+targetAttr].dtypes
Out[45]:
A2     float64
A3     float64
A8     float64
A11    float64
A14    float64
A15    float64
A16      int64
dtype: object
In [46]:
# Correlation table
df2[contAttr].corr()
Out[46]:
A2 A3 A8 A11 A14 A15
A2 1.0000 0.2177 0.4176 0.1982 -0.0846 0.0291
A3 0.2177 1.0000 0.3006 0.2698 -0.2171 0.1198
A8 0.4176 0.3006 1.0000 0.3273 -0.0648 0.0522
A11 0.1982 0.2698 0.3273 1.0000 -0.1161 0.0584
A14 -0.0846 -0.2171 -0.0648 -0.1161 1.0000 0.0734
A15 0.0291 0.1198 0.0522 0.0584 0.0734 1.0000
In [47]:
# Heatmap for correlation of numeric attributes
fig, ax = plt.subplots(figsize=(5,4))
sns.heatmap(df2[contAttr].corr(), annot = True, ax = ax, annot_kws={"fontsize":10});
In [48]:
# Correlation matrix for customers not granted credit
fig, ax = plt.subplots(figsize=(5,4))
sns.heatmap(df2[df2['A16'] == 0][contAttr].corr(), ax = ax, annot_kws={"fontsize":10}, annot = True);
In [49]:
# Correlation matrix for customers granted credit
fig, ax = plt.subplots(figsize=(5,4))
sns.heatmap(df2[df2['A16'] == 1][contAttr].corr(),ax = ax,
            annot_kws={"fontsize":10}, annot = True);

Univariate Analysis - Categorical Variables

In [50]:
# Continous Variables are A2, A3, A8, A11, A14, A15
# Categorical Input Variables are A1, A4, A5, A6, A7, A9, A10, A12, A13
# Target Variable is A16 and is categorical.

catAttr = ["A1","A4", "A5", "A6", "A7", "A9", "A10", "A12", "A13"]
In [53]:
eda_overview.UVA_category(df2, var_group = catAttr + targetAttr,
                          colwidth = 3,
                          rowheight = 2,
                          colcount = 2,
                          spine_linewidth = 0.2,
                          nspaces = 4, ncountspaces = 3,
                          axlabel_fntsize = 7,
                          ax_xticklabel_fntsize = 7,
                          ax_yticklabel_fntsize = 7,
                          change_ratio = 0.6,
                          infofntsize = 7)

Distribution of the Target Class

Dataset is balanced as the ratio of the binary classes is ~55:45.
We can use Accuracy as a Evaluation metric for the classifier model.

In [54]:
plt.figure(figsize = (4,3), dpi = 100)
ax = sns.countplot(x = 'A16', data = df2, )
ax.set_ylim(0, 1.1*ax.get_ylim()[1])

axes_utils.Add_data_labels(ax.patches)
axes_utils.Change_barWidth(ax.patches, 0.8)
axes_utils.Add_valuecountsinfo(ax, 'A16',df2)

Train Test Split the dataset

In [55]:
X, y = df2.drop(targetAttr, axis = 1), df2['A16']
In [56]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)
In [57]:
print("X_train shape : {}".format(X_train.shape))
print("X_test shape : {}".format(X_test.shape))
print("y_train shape : {}".format(y_train.shape))
print("y_test shape : {}".format(y_test.shape))
X_train shape : (489, 15)
X_test shape : (164, 15)
y_train shape : (489,)
y_test shape : (164,)
In [58]:
X_train.head()
Out[58]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15
7 a 22.92 11.585 u g cc v 0.040 t f 0.0 f g 80.0 1349.0
64 b 26.67 4.250 u g cc v 4.290 t t 1.0 t g 120.0 0.0
320 b 21.25 1.500 u g w v 1.500 f f 0.0 f g 150.0 8.0
358 b 32.42 3.000 u g d v 0.165 f f 0.0 t g 120.0 0.0
628 b 29.25 13.000 u g d h 0.500 f f 0.0 f g 228.0 0.0

Transformation Pipelines

In [59]:
# Creating numeric Pipeline for standard scaling of numeric features 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('std_scaler', StandardScaler())])
In [60]:
from sklearn import set_config

set_config(display="diagram")
num_pipeline
Out[60]:
Pipeline(steps=[('std_scaler', StandardScaler())])
Please rerun this cell to show the HTML repr or trust the notebook.
In [61]:
df2_num_tr = num_pipeline.fit_transform(df2[contAttr])
pd.DataFrame(df2_num_tr)
Out[61]:
0 1 2 3 4 5
0 -0.0570 -0.9614 -0.2952 -0.3026 0.1287 -0.1931
1 2.2965 -0.0736 0.2362 0.7045 -0.8168 -0.0864
2 -0.5921 -0.8619 -0.2210 -0.5040 0.5925 -0.0362
3 -0.3106 -0.6549 0.4470 0.5031 -0.4779 -0.1926
4 -0.9581 0.1584 -0.1586 -0.5040 -0.3589 -0.1931
... ... ... ... ... ... ...
648 -0.8812 1.0462 -0.2952 -0.5040 0.4736 -0.1931
649 -0.7468 -0.8121 -0.0725 -0.1012 0.1168 -0.1181
650 -0.5287 1.7261 -0.0725 -0.3026 0.1168 -0.1929
651 -1.1483 -0.9206 -0.6544 -0.5040 0.5925 -0.0502
652 0.2956 -0.2896 1.7948 -0.5040 -1.0725 -0.1931

653 rows × 6 columns

In [62]:
# Transforming the X_train and X_test
In [63]:
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder

# Segregating the numeric and categorical features
num_attribs = ['A2', 'A3','A8', 'A11', 'A14', 'A15']
cat_attribs = ["A1","A4", "A5", "A6", "A7", "A9", "A10", "A12", "A13"] 

# Creating Column Transformer for selectively applying tranformations
# both standard scaling and one hot encoding
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs), 
    ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs)])

# Creating Column Transformer for selectively applying tranformations
# only one hot encoding and no standard scaling
categorical_pipeline = ColumnTransformer([
    ("num_selector", "passthrough", num_attribs), 
    ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs)])
In [64]:
# Displaying the full_pipeline
from sklearn import set_config

set_config(display="diagram")
full_pipeline
Out[64]:
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('std_scaler',
                                                  StandardScaler())]),
                                 ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10',
                                  'A12', 'A13'])])
Please rerun this cell to show the HTML repr or trust the notebook.
In [65]:
# Displaying the categorical_pipeline
from sklearn import set_config

set_config(display="diagram")
categorical_pipeline
Out[65]:
ColumnTransformer(transformers=[('num_selector', 'passthrough',
                                 ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10',
                                  'A12', 'A13'])])
Please rerun this cell to show the HTML repr or trust the notebook.
In [66]:
# Learning the parameters for transforming from train set using full_pipeline 
# Transforming both train and test set
X_train_tr1 = full_pipeline.fit_transform(X_train)
X_test_tr1 = full_pipeline.transform(X_test)
In [67]:
# Learning the parameters for transforming from train set using categorical pipeline
# Transforming both train and test set
X_train_tr2 = categorical_pipeline.fit_transform(X_train)
X_test_tr2 = categorical_pipeline.transform(X_test)
In [68]:
# Transforming the target variable
In [69]:
from sklearn.preprocessing import LabelEncoder

# prepare input data
def prepare_targets(y_train, y_test): 
    le = LabelEncoder()
    le.fit(np.ravel(y_train))
    y_train_enc = le.transform(np.ravel(y_train)) 
    y_test_enc = le.transform(np.ravel(y_test)) 
    
    return y_train_enc, y_test_enc
In [70]:
y_train_tr, y_test_tr = prepare_targets(y_train, y_test)

Training and Testing Models

In [71]:
# Function for returning a string containing
# Classification report and the accuracy, precision, recall and F1 measures on train and test data
# The average parameter for the measures is 'macro' as the minority class is of importance.

from sklearn.metrics import recall_score, precision_score, accuracy_score, \
    confusion_matrix, ConfusionMatrixDisplay, classification_report, f1_score, \
    roc_curve, auc

def evaluation_parametrics(y_train,yp_train,y_test,yp_test,average_param = 'weighted'):
    '''
    average_param : values can be 'weighted', 'micro', 'macro'.
    Check link:
    https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
    '''
    d = 2
    txt = "-"*60 \
    + "\nClassification Report for Train Data\n" \
    + classification_report(y_train, yp_train) \
    + "\nClassification Report for Test Data\n" \
    + classification_report(y_test, yp_test) \
    + "\n" + "-"*60 + "\n" \
    + "Accuracy on Train Data is: {}".format(round(accuracy_score(y_train,yp_train),d)) \
    + '\n' \
    + "Accuracy on Test Data is: {}".format(round(accuracy_score(y_test,yp_test),d)) \
    + "\n" + "-"*60 + "\n" \
    + "Precision on Train Data is: {}".format(round(precision_score(y_train,yp_train,average = average_param),d)) \
    + "\n" \
    + "Precision on Test Data is: {}".format(round(precision_score(y_test,yp_test,average = average_param),d)) \
    + "\n" + "-"*60 + "\n" \
    + "Recall on Train Data is: {}".format(round(recall_score(y_train,yp_train,average = average_param),d)) \
    + "\n" \
    + 'Recall on Test Data is: {}'.format(round(recall_score(y_test,yp_test,average = average_param),d)) \
    + "\n" + "-"*60 + "\n" \
    + "F1 Score on Train Data is: {}".format(round(f1_score(y_train,yp_train,average = average_param),d)) \
    + "\n" \
    + "F1 Score on Test Data is: {}".format(round(f1_score(y_test,yp_test,average = average_param),d)) \
    + "\n" + "-"*60 + "\n" 
    return txt
In [72]:
def Confusion_matrix_ROC_AUC(name, alias, pipeline,
                             X_train_tr, y_train_tr,
                             X_test_tr,y_test_tr):
    '''
    This function reurns three plots :
        - Confusion matrix on testset predictions
        - Classification report for performance on train and test set
        - roc and auc curve for test set predictions

    The arguments are :
        name : short/terse name for the composite estimator
        alias : descriptive name for the composite estimator
        pipeline : Composite estimator
        X_train_tr, y_train_tr : train set feature matrix, train set target
        X_test_tr,y_test_tr : test set feature matrix, test set target

    For reference, below is a list containing the tuple of (name, alias, pipeline)
          [('SGD', 'Stochastic Gradient Classifier',SGDClassifier(random_state=42)),
          ('LR','Logistic Regression Classifier', LogisticRegression(max_iter = 1000,random_state = 48)),
          ('RF','Random Forest Classifier', RandomForestClassifier(max_depth=2, random_state=42)),
          ('KNN','KNN Classifier',KNeighborsClassifier(n_neighbors = 7)),
          ('NB','Naive Bayes Classifier', GaussianNB()),
          ('SVC','Support Vector Classifier',
          `LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)),
          ('CART', 'CART', DecisionTreeClassifier(max_depth = 7,random_state = 48)),
          ('GBM','Gradient Boosting Classifier',
           GradientBoostingClassifier(n_estimators=50, max_depth=10)),
          ('LDA', 'LDA Classifier', LinearDiscriminantAnalysis())]

    For instance, if the classifier is an SGDClassifier, the suggested name and alias are :
    'SGD'/'sgdclf', 'Stochastic Gradient Classifier'.

    It is recommended to adhere to name and alias conventions,
        - as the name argument is used to for checking whether calibrated Classifier CV is required or not.
        - as the alias argument will be used as title in the Classification report plot.

    Call the functions : evaluation_parametrics
    Check the links :
        https://peps.python.org/pep-0008/#maximum-line-length
        https://scikit-learn.org/stable/glossary.html#term-predict_proba

    Example :
        >> Confusion_matrix_ROC_AUC('sgd_clf','Stochastic Gradient Classifier',sgd_clf,
                                    X_train_tr, y_train_tr, X_test_tr,y_test_tr)

    '''
    from sklearn.metrics import recall_score, precision_score, accuracy_score, \
    confusion_matrix, ConfusionMatrixDisplay, classification_report, f1_score, \
    roc_curve, auc

    from sklearn.calibration import CalibratedClassifierCV

    fig = plt.figure(figsize=(10,5), dpi = 130)
    gridsize = (2, 3)
    ax1 = plt.subplot2grid(gridsize, (0, 0), colspan=1, rowspan=1)
    ax2 = plt.subplot2grid(gridsize, (0, 1), colspan = 2, rowspan = 2)
    ax3 = plt.subplot2grid(gridsize, (1, 0), colspan = 1)

    sns.set(font_scale=0.75) # Adjust to fit
    #---------------------------------------------------------------------------------
    # Displaying the confusion Matrix

    #ax1 = fig.add_subplot(1,3,2)

    # Fitting the model
    model = pipeline
    model.fit(X_train_tr, y_train_tr)

    # Predictions on train and test set
    yp_train_tr = model.predict(X_train_tr)
    yp_test_tr = model.predict(X_test_tr)

    # Creating the confusion matrix for test set results
    cm = confusion_matrix(y_test_tr, yp_test_tr, labels= pipeline.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels= pipeline.classes_)

    ax1.grid(False)
    disp.plot(ax = ax1)
    ax1.set_title('Confusion Matrix on testset pred')


    #---------------------------------------------------------------------------------
    # Displaying the evaluation results that include the classification report
    #ax2 = fig.add_subplot(1,3,2)

    eval_results = (str(alias) \
                    +'\n' \
                    + evaluation_parametrics(y_train_tr,yp_train_tr,
                                             y_test_tr,yp_test_tr))

    ax2.annotate(xy = (0,1), text = eval_results, size = 8,
                 ha = 'left', va = 'top', font = 'Andale Mono')
    ax2.patch.set(visible = False)
    ax2.tick_params(top=False, bottom=False, left=False, right=False,
                    labelleft=False, labelbottom=False)
    #ax2.ticks.off

    #---------------------------------------------------------------------------------
    # Displaying the ROC AUC curve
    import re
    pattern = re.compile('(sgd|SGD|SVC)')
    if re.search(pattern, name) :
        print('Calibrated Classifier CV needed.')
        #base_model = SGDClassifier()
        model = CalibratedClassifierCV(pipeline)
    else :
        print('Calibrated Classifier CV not needed.')
        model = pipeline

    # Fitting the model
    model.fit(X_train_tr, y_train_tr)

    #https://scikit-learn.org/stable/glossary.html#term-predict_proba
    preds = model.predict_proba(X_test_tr)
    pred = pd.Series(preds[:,1])

    fpr, tpr, thresholds = roc_curve(y_test_tr, pred)
    auc_score = auc(fpr, tpr)
    label='%s: auc=%f' % (name, auc_score)

    ax3.plot(fpr, tpr, linewidth=1)
    ax3.fill_between(fpr, tpr,  label = label, linewidth=1, alpha = 0.1, ec = 'black')
    ax3.plot([0, 1], [0, 1], 'k--') #x=y line.
    ax3.set_xlim([0.0, 1.0])
    ax3.set_ylim([0.0, 1.05])
    ax3.set_xlabel('False Positive Rate')
    ax3.set_ylabel('True Positive Rate')
    ax3.set_title('ROC curve')
    ax3.legend(loc = 'lower right')

    fig.tight_layout()
    plt.show()
    return fig
In [2]:
# Creating a dictionary to store the performace measures on train and test data
# Note the precion, recall and F1 score measures are weighted averages taking into consideration the class sizes

def create_dict(model, modelname, y_train, yp_train, y_test, yp_test, average_param = 'weighted'):
    '''
    average_param : values can be 'weighted', 'micro', 'macro'.
    Check link:
            https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
            https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
            https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
            #sklearn.metrics.precision_score
    
    '''
    d = 4
    dict1 = {modelname :  {"Accuracy":{"Train": float(np.round(accuracy_score(y_train,yp_train), d)),
                                       "Test": float(np.round(accuracy_score(y_test,yp_test),d))},
                           "F1" : {"Train": 
                                   float(np.round(f1_score(y_train,yp_train,average = average_param),d)),
                                  "Test": 
                                   float(np.round(f1_score(y_test,yp_test,average = average_param),d))},
                           "Recall": {"Train": 
                                      float(np.round(recall_score(y_train,yp_train,average = average_param),d)),
                                      "Test": 
                                      float(np.round(recall_score(y_test,yp_test,average = average_param),d))},
                           "Precision" :{"Train": 
                                         float(np.round(precision_score(y_train,yp_train,average = average_param),d)),
                                         "Test": 
                                         float(np.round(precision_score(y_test,yp_test,average = average_param),d))
                                       }}
            }
    return dict1

dict_perf = {}
In [74]:
# Display the performance measure outputs for all the classifiers
# unpacking the dictionary to dataframe

def display_results(dict_perf):
    pd.set_option('precision', 4)
    user_ids = []
    frames = []
    for user_id, d in dict_perf.items():
        user_ids.append(user_id)
        frames.append(pd.DataFrame.from_dict(d, orient='columns'))

    df = pd.concat(frames, keys=user_ids)
    df = df.unstack(level = -1)
    return df

Stochastic Gradient Descent Classifier

In [75]:
from sklearn.linear_model import SGDClassifier 
from sklearn.calibration import CalibratedClassifierCV

name = 'sgdclf'
sgd_clf = SGDClassifier(random_state=42)

st_time = time.time()
sgd_clf.fit(X_train_tr1, y_train_tr)

yp_train_tr = sgd_clf.predict(X_train_tr1)
yp_test_tr = sgd_clf.predict(X_test_tr1)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr,average_param = 'weighted'))

dict1 = create_dict(sgd_clf, "SGD Classifier", 
                    y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

sns.set(font_scale=0.75)
fig = Confusion_matrix_ROC_AUC('sgd_clf','Stochastic Gradient Classifier',
                         sgd_clf, X_train_tr1, y_train_tr,X_test_tr1,y_test_tr)

fig.savefig(path+'StochasticGradientClassifier.png', dpi = 150)
Total time: 0.01s
Calibrated Classifier CV needed.

Logistic Regression Classifier

In [76]:
lr = LogisticRegression(max_iter = 1000,random_state = 48)

st_time = time.time()
lr.fit(X_train_tr1, y_train_tr)

yp_train_tr = lr.predict(X_train_tr1)
yp_test_tr = lr.predict(X_test_tr1)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(lr, "Logistic Regression Classifier", y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('lr','Logistic Regression Classifier',lr, X_train_tr1, y_train_tr,X_test_tr1,y_test_tr)
Total time: 0.02s
Calibrated Classifier CV not needed.

Random Forest Classifier

In [77]:
rf_clf = RandomForestClassifier(max_depth=5, random_state=42)

st_time = time.time()
rf_clf.fit(X_train_tr2, y_train_tr)

yp_train_tr = rf_clf.predict(X_train_tr2)
yp_test_tr = rf_clf.predict(X_test_tr2)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(rf_clf, "Random Forest Classifier", 
                    y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('rf_clf','Random Forest Classifier',rf_clf,
                         X_train_tr2, y_train_tr,X_test_tr2,y_test_tr)
Total time: 0.15s
Calibrated Classifier CV not needed.

KNN Classifier

In [78]:
# training a KNN classifier
knn_clf = KNeighborsClassifier(n_neighbors = 7)

st_time = time.time()
knn_clf.fit(X_train_tr1, y_train_tr)

yp_train_tr = knn_clf.predict(X_train_tr1)
yp_test_tr = knn_clf.predict(X_test_tr1)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(knn_clf, "k-Nearest Neighbor Classifier", y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('knn_clf', "k-Nearest Neighbor Classifier",knn_clf,
                         X_train_tr1, y_train_tr,X_test_tr1,y_test_tr)
Total time: 0.03s
Calibrated Classifier CV not needed.

Naive Bayes Classifier

In [79]:
# training a Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
gnb_clf = GaussianNB()

st_time = time.time()
gnb_clf.fit(X_train_tr2, y_train_tr)

yp_train_tr = gnb_clf.predict(X_train_tr2)
yp_test_tr = gnb_clf.predict(X_test_tr2)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(gnb_clf, "Gaussian Naive Bayes Classifier", y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('gnb_clf', "Gaussian Naive Bayes Classifier",gnb_clf,
                         X_train_tr2, y_train_tr,X_test_tr2,y_test_tr)
Total time: 0.00s
Calibrated Classifier CV not needed.

Linear Support Vector Classifier

In [80]:
svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)

st_time = time.time()
svm.fit(X_train_tr1,y_train_tr)

yp_train_tr = svm.predict(X_train_tr1)
yp_test_tr = svm.predict(X_test_tr1)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(svm, "Support Vector Classifier", y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('LinearSVC', "Support Vector Classifier",svm,
                         X_train_tr1, y_train_tr,X_test_tr1,y_test_tr)
Total time: 0.01s
Calibrated Classifier CV needed.

Decision Tree Classifier

In [81]:
dt = DecisionTreeClassifier(max_depth = 5,random_state = 48) 
# Keeping max_depth = 7 to avoid overfitting
dt.fit(X_train_tr2,y_train_tr)

yp_train_tr = dt.predict(X_train_tr2)
yp_test_tr = dt.predict(X_test_tr2)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(dt, "Decision Tree Classifier", y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('dt', "Decision Tree Classifier", dt,
                         X_train_tr2, y_train_tr,X_test_tr2,y_test_tr)
Total time: 0.53s
Calibrated Classifier CV not needed.

Gradient Boosting Model

In [82]:
gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=5)

st_time = time.time()
gb_model.fit(X_train_tr2,y_train_tr)

yp_train_tr = gb_model.predict(X_train_tr2)
yp_test_tr = gb_model.predict(X_test_tr2)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))

#print(evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr))

dict1 = create_dict(gb_model, "Gradient Boosting Classifier", y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)


fig = Confusion_matrix_ROC_AUC('gb_model', "Gradient Boosting Classifier", gb_model,
                         X_train_tr2, y_train_tr,X_test_tr2,y_test_tr)
Total time: 0.11s
Calibrated Classifier CV not needed.

Linear Discriminant Analysis Model

In [83]:
lda_model = LinearDiscriminantAnalysis()

st_time = time.time()
lda_model.fit(X_train_tr1,y_train_tr)

yp_train_tr = lda_model.predict(X_train_tr1)
yp_test_tr = lda_model.predict(X_test_tr1)

en_time = time.time()
#print('Total time: {:.2f}s'.format(en_time-st_time))

evaluation_parametrics(y_train_tr,yp_train_tr,y_test_tr,yp_test_tr)

dict1 = create_dict(lda_model, "Linear Discriminant Analysis Classifier",
                    y_train_tr, yp_train_tr, y_test_tr, yp_test_tr)
dict_perf.update(dict1)

fig = Confusion_matrix_ROC_AUC('lda_model', "Linear Discriminant Analysis Classifier", lda_model,
                         X_train_tr1, y_train_tr,X_test_tr1,y_test_tr)
Calibrated Classifier CV not needed.
In [84]:
from matplotlib import pyplot as plt
import sklearn
from sklearn.metrics import roc_curve, auc
#from sklearn.cross_validation import train_test_split 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.naive_bayes import GaussianNB

# name -> (line format, classifier) 
CLASS_MAP = {
    'LogisticRegression':('-', LogisticRegression()),
    'Naive Bayes': ('--', GaussianNB()),
    'Decision Tree':('.-', DecisionTreeClassifier(max_depth=5)), 
    'Random Forest':(':', RandomForestClassifier( max_depth=5, n_estimators=10, max_features=1)),
}

# Divide cols by independent/dependent, rows by test/ train
#X, Y = df[df.columns[:3]], (df['species']=='virginica') X_train, X_test, Y_train, Y_test = \
#train_test_split(X, Y, test_size=.8)

#X_train_tr, y_train_tr,X_test_tr,y_test_tr

for name, (line_fmt, model) in CLASS_MAP.items():
    model.fit(X_train_tr1, y_train_tr)
    # array w one col per label
    preds = model.predict_proba(X_test_tr1)
    pred = pd.Series(preds[:,1])
    fpr, tpr, thresholds = roc_curve(y_test_tr, pred) 
    auc_score = auc(fpr, tpr)
    label='%s: auc=%f' % (name, auc_score) 
    plt.plot(fpr, tpr, line_fmt,
             linewidth=1, label=label)
    plt.legend(loc="lower right") 
    plt.title('Comparing Classifiers')

Listing the performance from all the models

In [85]:
display_results(dict_perf).style.background_gradient(cmap='Blues')
Out[85]:
Accuracy F1 Recall Precision
Train Test Train Test Train Test Train Test
SGD Classifier 0.8528 0.8049 0.8531 0.8053 0.8528 0.8049 0.8568 0.8117
Logistic Regression Classifier 0.8855 0.8902 0.8857 0.8905 0.8855 0.8902 0.8878 0.8974
Random Forest Classifier 0.9141 0.8902 0.9142 0.8904 0.9141 0.8902 0.9144 0.8917
k-Nearest Neighbor Classifier 0.8712 0.8720 0.8707 0.8717 0.8712 0.8720 0.8719 0.8720
Gaussian Naive Bayes Classifier 0.8487 0.8354 0.8475 0.8335 0.8487 0.8354 0.8512 0.8395
Support Vector Classifier 0.8650 0.8841 0.8652 0.8842 0.8650 0.8841 0.8738 0.8992
Decision Tree Classifier 0.9182 0.8354 0.9181 0.8347 0.9182 0.8354 0.9184 0.8356
Gradient Boosting Classifier 0.9939 0.8780 0.9939 0.8783 0.9939 0.8780 0.9939 0.8795
Linear Discriminant Analysis Classifier 0.8691 0.8780 0.8693 0.8780 0.8691 0.8780 0.8789 0.8949

Model Validation using K-Fold CrossValidation

In [86]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'accuracy'
In [87]:
# Creating ColumnTransformer for preprocessing the training data folds and testing data fold within the k-fold 
# Note that training data folds will be fitted and transformed
# The test data folds will be transformed

from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder

num_attribs = ['A2', 'A3','A8', 'A11', 'A14', 'A15']
cat_attribs = ["A1","A4", "A5", "A6", "A7", "A9", "A10", "A12", "A13"] 

preprocessor1 = ColumnTransformer([
    ("num", num_pipeline, num_attribs), 
    ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs)])

preprocessor2 = ColumnTransformer([("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs)],
                                  remainder = 'passthrough')
In [88]:
# Creating list of classifier models
models = [('SGD',SGDClassifier(random_state=42)),
          ('LR',LogisticRegression(max_iter = 1000,random_state = 48)),
          ('RF',RandomForestClassifier(max_depth=2, random_state=42)),
          ('KNN',KNeighborsClassifier(n_neighbors = 7)),
          ('NB',GaussianNB()),
          ('SVC',LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)),
          ('CART',DecisionTreeClassifier(max_depth = 7,random_state = 48)),
          ('GBM',GradientBoostingClassifier(n_estimators=50, max_depth=10)),
          ('LDA',LinearDiscriminantAnalysis())
         ]
In [89]:
# Creating the pipeline of model and the preprocessor for feeding into the k-fold crossvalidation 
pipelines_list = []
for i in models:
    if i[0] not in ['RF','CART', 'GBM']:
        pipelines_list.append(('Scaled'+str(i[0]), Pipeline([('Preprocessor1', preprocessor1),i])))
    else:
        pipelines_list.append((str(i[0]), Pipeline([('Preprocessor2', preprocessor2),i])))
In [90]:
# Checking the pipeline
pipelines_list
Out[90]:
[('ScaledSGD',
  Pipeline(steps=[('Preprocessor1',
                   ColumnTransformer(transformers=[('num',
                                                    Pipeline(steps=[('std_scaler',
                                                                     StandardScaler())]),
                                                    ['A2', 'A3', 'A8', 'A11',
                                                     'A14', 'A15']),
                                                   ('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('SGD', SGDClassifier(random_state=42))])),
 ('ScaledLR',
  Pipeline(steps=[('Preprocessor1',
                   ColumnTransformer(transformers=[('num',
                                                    Pipeline(steps=[('std_scaler',
                                                                     StandardScaler())]),
                                                    ['A2', 'A3', 'A8', 'A11',
                                                     'A14', 'A15']),
                                                   ('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('LR', LogisticRegression(max_iter=1000, random_state=48))])),
 ('RF',
  Pipeline(steps=[('Preprocessor2',
                   ColumnTransformer(remainder='passthrough',
                                     transformers=[('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('RF', RandomForestClassifier(max_depth=2, random_state=42))])),
 ('ScaledKNN',
  Pipeline(steps=[('Preprocessor1',
                   ColumnTransformer(transformers=[('num',
                                                    Pipeline(steps=[('std_scaler',
                                                                     StandardScaler())]),
                                                    ['A2', 'A3', 'A8', 'A11',
                                                     'A14', 'A15']),
                                                   ('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('KNN', KNeighborsClassifier(n_neighbors=7))])),
 ('ScaledNB',
  Pipeline(steps=[('Preprocessor1',
                   ColumnTransformer(transformers=[('num',
                                                    Pipeline(steps=[('std_scaler',
                                                                     StandardScaler())]),
                                                    ['A2', 'A3', 'A8', 'A11',
                                                     'A14', 'A15']),
                                                   ('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('NB', GaussianNB())])),
 ('ScaledSVC',
  Pipeline(steps=[('Preprocessor1',
                   ColumnTransformer(transformers=[('num',
                                                    Pipeline(steps=[('std_scaler',
                                                                     StandardScaler())]),
                                                    ['A2', 'A3', 'A8', 'A11',
                                                     'A14', 'A15']),
                                                   ('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('SVC',
                   LinearSVC(C=0.1, class_weight='balanced', max_iter=10000,
                             verbose=False))])),
 ('CART',
  Pipeline(steps=[('Preprocessor2',
                   ColumnTransformer(remainder='passthrough',
                                     transformers=[('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('CART', DecisionTreeClassifier(max_depth=7, random_state=48))])),
 ('GBM',
  Pipeline(steps=[('Preprocessor2',
                   ColumnTransformer(remainder='passthrough',
                                     transformers=[('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('GBM',
                   GradientBoostingClassifier(max_depth=10, n_estimators=50))])),
 ('ScaledLDA',
  Pipeline(steps=[('Preprocessor1',
                   ColumnTransformer(transformers=[('num',
                                                    Pipeline(steps=[('std_scaler',
                                                                     StandardScaler())]),
                                                    ['A2', 'A3', 'A8', 'A11',
                                                     'A14', 'A15']),
                                                   ('cat',
                                                    OneHotEncoder(handle_unknown='ignore'),
                                                    ['A1', 'A4', 'A5', 'A6', 'A7',
                                                     'A9', 'A10', 'A12',
                                                     'A13'])])),
                  ('LDA', LinearDiscriminantAnalysis())]))]
In [91]:
# Evaluating the Algorithms
results = []
names = []

st_time = time.time()
for name, pipeline in pipelines_list:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle = True)
    cv_results = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring=scoring) 
    print(type(pipeline))
    results.append(cv_results)
    names.append(name)
    msg = "{:<10}: {:<6} ({:^6})".format(name, cv_results.mean().round(4), cv_results.std().round(4))
    print(msg)
    print(X_train)
    
en_time = time.time()
print('Total time: {:.4f}s'.format(en_time-st_time))
<class 'sklearn.pipeline.Pipeline'>
ScaledSGD : 0.8119 (0.0897)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
ScaledLR  : 0.8569 (0.0481)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
RF        : 0.8589 (0.0494)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
ScaledKNN : 0.8406 (0.0622)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
ScaledNB  : 0.6685 (0.0645)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
ScaledSVC : 0.8529 (0.0424)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
CART      : 0.8283 (0.0387)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
GBM       : 0.8017 (0.0538)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
<class 'sklearn.pipeline.Pipeline'>
ScaledLDA : 0.8529 (0.0433)
    A1     A2      A3 A4 A5  A6  A7     A8 A9 A10  A11 A12 A13    A14     A15
7    a  22.92  11.585  u  g  cc   v  0.040  t   f  0.0   f   g   80.0  1349.0
64   b  26.67   4.250  u  g  cc   v  4.290  t   t  1.0   t   g  120.0     0.0
320  b  21.25   1.500  u  g   w   v  1.500  f   f  0.0   f   g  150.0     8.0
358  b  32.42   3.000  u  g   d   v  0.165  f   f  0.0   t   g  120.0     0.0
628  b  29.25  13.000  u  g   d   h  0.500  f   f  0.0   f   g  228.0     0.0
..  ..    ...     ... .. ..  ..  ..    ... ..  ..  ...  ..  ..    ...     ...
621  b  22.67   0.165  u  g   c   j  2.250  f   f  0.0   t   s    0.0     0.0
521  a  30.00   5.290  u  g   e  dd  2.250  t   t  5.0   t   g   99.0   500.0
156  a  28.50   3.040  y  p   x   h  2.540  t   t  1.0   f   g   70.0     0.0
509  b  21.00   4.790  y  p   w   v  2.250  t   t  1.0   t   g   80.0   300.0
119  a  20.75  10.335  u  g  cc   h  0.335  t   t  1.0   t   g   80.0    50.0

[489 rows x 15 columns]
Total time: 4.1834s
In [92]:
tmp = pd.DataFrame(results).transpose()
tmp.columns = names
tmp
Out[92]:
ScaledSGD ScaledLR RF ScaledKNN ScaledNB ScaledSVC CART GBM ScaledLDA
0 0.8571 0.9184 0.9388 0.9184 0.6735 0.9184 0.7959 0.8980 0.9184
1 0.6735 0.8367 0.7755 0.7143 0.6122 0.7959 0.8163 0.7959 0.8163
2 0.8367 0.7959 0.8163 0.8163 0.7347 0.8163 0.7347 0.6939 0.8367
3 0.8980 0.8980 0.9184 0.9184 0.6327 0.8776 0.8367 0.8163 0.8776
4 0.9184 0.8776 0.8980 0.8367 0.7959 0.8571 0.8571 0.7755 0.8571
5 0.8776 0.8367 0.8163 0.8367 0.7143 0.8367 0.8571 0.8367 0.8163
6 0.6327 0.7959 0.8163 0.7959 0.6122 0.8367 0.8776 0.8163 0.8367
7 0.7755 0.9184 0.8776 0.8571 0.6735 0.8776 0.8163 0.8163 0.8776
8 0.7959 0.7959 0.8571 0.7959 0.6735 0.7959 0.8367 0.7347 0.7755
9 0.8542 0.8958 0.8750 0.9167 0.5625 0.9167 0.8542 0.8333 0.9167
In [93]:
tmp.mean()
Out[93]:
ScaledSGD    0.8119
ScaledLR     0.8569
RF           0.8589
ScaledKNN    0.8406
ScaledNB     0.6685
ScaledSVC    0.8529
CART         0.8283
GBM          0.8017
ScaledLDA    0.8529
dtype: float64
In [94]:
print('The top 4 algorithms based on crossvalidation performance are :')

for alg, value in tmp.mean().sort_values(ascending = False)[0:4].iteritems():
    print('{: <20} : {: 1.4f}'.format(alg, value))
The top 4 algorithms based on crossvalidation performance are :
RF                   :  0.8589
ScaledLR             :  0.8569
ScaledSVC            :  0.8529
ScaledLDA            :  0.8529

Comparing the mean and standard deviation of performance measures

In [95]:
# from cross validation using various classifier models
tmp1 = pd.concat([tmp.mean(), tmp.std()], axis = 1, keys = ['mean','std_dev'])
tmp1.style.background_gradient(cmap = 'Blues')
Out[95]:
mean std_dev
ScaledSGD 0.8119 0.0945
ScaledLR 0.8569 0.0507
RF 0.8589 0.0521
ScaledKNN 0.8406 0.0656
ScaledNB 0.6685 0.0680
ScaledSVC 0.8529 0.0446
CART 0.8283 0.0408
GBM 0.8017 0.0567
ScaledLDA 0.8529 0.0457
In [96]:
tmp1['mean'].idxmax(axis = 'columns')
Out[96]:
'RF'
In [97]:
np.argsort(tmp1['mean'])
Out[97]:
ScaledSGD    4
ScaledLR     7
RF           0
ScaledKNN    6
ScaledNB     3
ScaledSVC    8
CART         5
GBM          1
ScaledLDA    2
Name: mean, dtype: int64
In [98]:
# Understanding of np.argsort or df['col'].argsort()
n = 2
algorithm_index = tmp1['mean'].index.to_list()
top_n_idx = tmp1['mean'].argsort()[::-1][:n].values

top2_algorithms = [algorithm_index[i] for i in top_n_idx]
top2_algorithms
Out[98]:
['RF', 'ScaledLR']
In [99]:
top_n_idx
Out[99]:
array([2, 1])
In [100]:
n = 2
tmp1['mean'].argsort()[::-1][n]
Out[100]:
5
In [101]:
n = 2
avgDists = np.array([1, 8, 6, 9, 4])
ids = avgDists.argsort()
ids
Out[101]:
array([0, 4, 2, 1, 3])
In [102]:
type(tmp1['mean'].argsort()[0])
Out[102]:
numpy.int64
In [103]:
# Compare Algorithms
plt.rcdefaults()
fig = plt.figure(figsize = (6,3)) 
ax = fig.add_subplot(111) 
sns.boxplot(data = tmp, color = 'lightgrey', linewidth = 1, width = 0.5, orient = 'h')

# Coloring box-plots of top 2 mean values
n = 2
algorithm_index = tmp1['mean'].index.to_list()
top_2_idx = tmp1['mean'].argsort()[::-1][:n].values

for i in top_2_idx :
    # Select which box you want to change    
    mybox = ax.patches[i]

    # Change the appearance of that box
    mybox.set_facecolor('salmon')
    mybox.set_alpha(0.8)
    # mybox.set_edgecolor('black')
    # mybox.set_linewidth(3)

# Coloring box-plots of 3rd and 4th mean values
top_3_4_idx = tmp1['mean'].argsort()[::-1][2:4].values
for i in top_3_4_idx :
    # Select which box you want to change    
    mybox = ax.patches[i]

    # Change the appearance of that box
    mybox.set_facecolor('mediumturquoise')
    mybox.set_alpha(0.7)
    # mybox.set_edgecolor('black')
    # mybox.set_linewidth(3)

ax.grid(True, alpha = 0.4, ls = '--')
ax.set_axisbelow(True)
[labels.set(rotation = 20, ha = 'right') for labels in ax.get_xticklabels()]
[labels.set(size = 8) for labels in ax.get_yticklabels()]

for key, _  in ax.spines._dict.items():
    ax.spines._dict[key].set_linewidth(.5)

ax.set_title('Algorithm Comparison using 10-fold CV scores', ha = 'center' )
ax.set_xlabel('CV score')
#ax.set_ylim(0.6,1)
plt.show()
fig.savefig(path+'spotchecking algorithms using 10 fold CV.png', dpi = 175)

Results from simple Train Test split

In [104]:
display_results(dict_perf).style.background_gradient(cmap='Blues')
Out[104]:
Accuracy F1 Recall Precision
Train Test Train Test Train Test Train Test
SGD Classifier 0.8528 0.8049 0.8531 0.8053 0.8528 0.8049 0.8568 0.8117
Logistic Regression Classifier 0.8855 0.8902 0.8857 0.8905 0.8855 0.8902 0.8878 0.8974
Random Forest Classifier 0.9141 0.8902 0.9142 0.8904 0.9141 0.8902 0.9144 0.8917
k-Nearest Neighbor Classifier 0.8712 0.8720 0.8707 0.8717 0.8712 0.8720 0.8719 0.8720
Gaussian Naive Bayes Classifier 0.8487 0.8354 0.8475 0.8335 0.8487 0.8354 0.8512 0.8395
Support Vector Classifier 0.8650 0.8841 0.8652 0.8842 0.8650 0.8841 0.8738 0.8992
Decision Tree Classifier 0.9182 0.8354 0.9181 0.8347 0.9182 0.8354 0.9184 0.8356
Gradient Boosting Classifier 0.9939 0.8780 0.9939 0.8783 0.9939 0.8780 0.9939 0.8795
Linear Discriminant Analysis Classifier 0.8691 0.8780 0.8693 0.8780 0.8691 0.8780 0.8789 0.8949

Selecting the Algorithms from K-Fold cross validation

From both simple train/test/split and cross validation methods, we have shortlisted the below two classifiers with the highest accuracy measures :
-- Logistic Regression Classifier and
-- Random Forest Classifier

Tuning the Selected Algorithm

Tuning Logistic Regression Classifier

In [105]:
# https://stackoverflow.com/questions/62331674/sklearn-combine-gridsearchcv-with-column-transform-and-pipeline?noredirect=1&lq=1

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd  # doctest: +SKIP

# define dataset
X, y = X_train, y_train

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold, RepeatedStratifiedKFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


#numerical_features=make_column_selector(dtype_include=np.number)
#cat_features=make_column_selector(dtype_exclude=np.number)

num_attribs = ['A2', 'A3','A8', 'A11', 'A14', 'A15']
cat_attribs = ["A1","A4", "A5", "A6", "A7", "A9", "A10", "A12", "A13"] 

# Setting the pipeline for preprocessing numeric and categorical variables
preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_attribs), 
    ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs)])

# Creating a composite estimator by appending classifier estimator to preprocessor pipeline
model_lr = make_pipeline(preprocessor,
                         LogisticRegression(random_state = 48, max_iter = 1000) )

# define models and parameters
#model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]

# define grid search
grid_lr = dict(logisticregression__solver=solvers,
            logisticregression__penalty=penalty,
            logisticregression__C=c_values)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Instantiating GridSearchCV object with composite estimator, parameter grid, CV(crossvalidation generator)
grid_search_lr = GridSearchCV(estimator=model_lr, param_grid=grid_lr, 
                              n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)

grid_result_lr = grid_search_lr.fit(X, y)

# summarize results
print("Best: %f using %s" % (grid_result_lr.best_score_, grid_result_lr.best_params_))
means = grid_result_lr.cv_results_['mean_test_score']
stds = grid_result_lr.cv_results_['std_test_score']
params = grid_result_lr.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
Best: 0.860913 using {'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
0.847917 (0.054990) with: {'logisticregression__C': 100, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'newton-cg'}
0.847917 (0.054990) with: {'logisticregression__C': 100, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}
0.847917 (0.054990) with: {'logisticregression__C': 100, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
0.853387 (0.049407) with: {'logisticregression__C': 10, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'newton-cg'}
0.854082 (0.049850) with: {'logisticregression__C': 10, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}
0.853387 (0.049407) with: {'logisticregression__C': 10, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
0.852693 (0.046922) with: {'logisticregression__C': 1.0, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'newton-cg'}
0.852693 (0.046922) with: {'logisticregression__C': 1.0, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}
0.852693 (0.046922) with: {'logisticregression__C': 1.0, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
0.858844 (0.047125) with: {'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'newton-cg'}
0.858844 (0.047125) with: {'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}
0.860913 (0.047333) with: {'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
0.837032 (0.054992) with: {'logisticregression__C': 0.01, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'newton-cg'}
0.837032 (0.054992) with: {'logisticregression__C': 0.01, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'lbfgs'}
0.840434 (0.051608) with: {'logisticregression__C': 0.01, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
In [106]:
set_config(display='text')
grid_result_lr.best_estimator_
Out[106]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('std_scaler',
                                                                   StandardScaler())]),
                                                  ['A2', 'A3', 'A8', 'A11',
                                                   'A14', 'A15']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['A1', 'A4', 'A5', 'A6', 'A7',
                                                   'A9', 'A10', 'A12',
                                                   'A13'])])),
                ('logisticregression',
                 LogisticRegression(C=0.1, max_iter=1000, random_state=48,
                                    solver='liblinear'))])
In [107]:
set_config(display='diagram')
grid_search_lr.best_estimator_
Out[107]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('std_scaler',
                                                                   StandardScaler())]),
                                                  ['A2', 'A3', 'A8', 'A11',
                                                   'A14', 'A15']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['A1', 'A4', 'A5', 'A6', 'A7',
                                                   'A9', 'A10', 'A12',
                                                   'A13'])])),
                ('logisticregression',
                 LogisticRegression(C=0.1, max_iter=1000, random_state=48,
                                    solver='liblinear'))])
Please rerun this cell to show the HTML repr or trust the notebook.

Tuning RandomForest Classifier

In [108]:
# https://stackoverflow.com/questions/62331674/sklearn-combine-gridsearchcv-with-column-transform-and-pipeline?noredirect=1&lq=1

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd  # doctest: +SKIP

# define dataset
X, y = X_train, y_train

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


#numerical_features=make_column_selector(dtype_include=np.number)
#cat_features=make_column_selector(dtype_exclude=np.number)

num_attribs = ['A2', 'A3','A8', 'A11', 'A14', 'A15']
cat_attribs = ["A1","A4", "A5", "A6", "A7", "A9", "A10", "A12", "A13"] 

preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_attribs), 
    ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs)])

model_rf = make_pipeline(preprocessor, RandomForestClassifier())

# define model parameters
n_estimators = [10, 100, 1000]
max_features = ['sqrt', 'log2']

# define grid search
grid_rf = dict(randomforestclassifier__n_estimators=n_estimators,
            randomforestclassifier__max_features=max_features
           )

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search_rf = GridSearchCV(estimator=model_rf, param_grid=grid_rf, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result_rf = grid_search_rf.fit(X, y)

# summarize results
print("Best: %f using %s" % (grid_result_rf.best_score_, grid_result_rf.best_params_))
Best: 0.872562 using {'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__n_estimators': 100}
In [109]:
means = grid_result_rf.cv_results_['mean_test_score']
stds = grid_result_rf.cv_results_['std_test_score']
params = grid_result_rf.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
0.858149 (0.058613) with: {'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__n_estimators': 10}
0.872562 (0.044866) with: {'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__n_estimators': 100}
0.871840 (0.045962) with: {'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__n_estimators': 1000}
0.850680 (0.045532) with: {'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__n_estimators': 10}
0.869813 (0.046994) with: {'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__n_estimators': 100}
0.871173 (0.043511) with: {'randomforestclassifier__max_features': 'log2', 'randomforestclassifier__n_estimators': 1000}

Training the final Model on full training set

Since, refit = True in grid_search_lr (The GridSearchCV for logistic Regression),
the best estimator - grid_result_lr.bestestimator is the final model that is trained on the entire model.
We will use the trained model to make predictions and assess performance on both training and test sets.

In [110]:
# Inspecting the hyperparameters of the tuned estimator
In [111]:
set_config(display='text')
grid_result_lr.best_estimator_
Out[111]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('std_scaler',
                                                                   StandardScaler())]),
                                                  ['A2', 'A3', 'A8', 'A11',
                                                   'A14', 'A15']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['A1', 'A4', 'A5', 'A6', 'A7',
                                                   'A9', 'A10', 'A12',
                                                   'A13'])])),
                ('logisticregression',
                 LogisticRegression(C=0.1, max_iter=1000, random_state=48,
                                    solver='liblinear'))])
In [112]:
final_estimator = grid_result_lr.best_estimator_.named_steps['logisticregression']
print(final_estimator)
print('\n')
print('Coefficients of Logistic Regression Model : \n {}'.format(final_estimator.coef_))
LogisticRegression(C=0.1, max_iter=1000, random_state=48, solver='liblinear')


Coefficients of Logistic Regression Model : 
 [[-6.58541123e-03 -3.51822076e-03  3.09711492e-01  4.70027755e-01
  -1.13323153e-01  4.35822579e-01  1.68285927e-02 -8.75341871e-02
   7.47808800e-02  7.28776727e-02 -2.18364147e-01  7.28776727e-02
   7.47808800e-02 -2.18364147e-01 -1.39144167e-01  4.01314072e-02
   4.11012844e-01  8.55332882e-03 -2.12058549e-02 -3.27693790e-01
  -1.76320240e-01 -2.68447620e-02 -2.30368098e-01 -2.52030447e-02
   4.08624678e-02  5.64167500e-03 -3.60899378e-02  4.05962577e-01
   6.25902716e-02  3.63074013e-02 -2.61934006e-01  1.02733304e-01
   5.97703732e-02  8.88998116e-02  8.68297921e-04 -7.75378510e-02
  -8.24031970e-02 -1.21443647e+00  1.14373088e+00 -3.02496249e-01
   2.31790655e-01  1.00373967e-02 -8.07429912e-02 -1.80524180e-02
  -1.20565649e-02 -4.05966115e-02]]

Testing the model on entire train set

In [113]:
final_model = grid_result_lr.best_estimator_

X_train_prepared = final_model['columntransformer'].transform(X_train)
train_predictions = final_model.named_steps['logisticregression'].predict(X_train_prepared)
In [114]:
# Testing the final model on test set
X_test, y_test = X_test, y_test
X_test_prepared = final_model['columntransformer'].transform(X_test)
final_predictions = final_model['logisticregression'].predict(X_test_prepared)

evaluation_parametrics(y_train,train_predictions,y_test,final_predictions)

cm = confusion_matrix(y_test,final_predictions, labels= final_model['logisticregression'].classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels= final_model['logisticregression'].classes_)

sns.set(font_scale=1.5) # Adjust to fit 
disp.plot()
plt.gca().grid(False)
plt.show()
In [115]:
# Inspecting the pipeline objects
final_model.steps
Out[115]:
[('columntransformer',
  ColumnTransformer(transformers=[('num',
                                   Pipeline(steps=[('std_scaler',
                                                    StandardScaler())]),
                                   ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']),
                                  ('cat', OneHotEncoder(handle_unknown='ignore'),
                                   ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10',
                                    'A12', 'A13'])])),
 ('logisticregression',
  LogisticRegression(C=0.1, max_iter=1000, random_state=48, solver='liblinear'))]
In [116]:
final_model['columntransformer'].named_transformers_['num'].get_params()
#https://stackoverflow.com/questions/67374844/how-to-find-out-standardscaling-parameters-mean-and-scale-when-using-column
Out[116]:
{'memory': None,
 'steps': [('std_scaler', StandardScaler())],
 'verbose': False,
 'std_scaler': StandardScaler(),
 'std_scaler__copy': True,
 'std_scaler__with_mean': True,
 'std_scaler__with_std': True}
In [117]:
final_model['columntransformer'].named_transformers_['num'].named_steps['std_scaler'].__getstate__()
Out[117]:
{'with_mean': True,
 'with_std': True,
 'copy': True,
 'feature_names_in_': array(['A2', 'A3', 'A8', 'A11', 'A14', 'A15'], dtype=object),
 'n_features_in_': 6,
 'n_samples_seen_': 489,
 'mean_': array([  31.64194274,    4.80782209,    2.16182004,    2.62167689,
         176.64212679, 1061.93251534]),
 'var_': array([1.44527358e+02, 2.44550009e+01, 1.08188797e+01, 2.72208798e+01,
        2.82397022e+04, 3.48843393e+07]),
 'scale_': array([1.20219532e+01, 4.94519979e+00, 3.28920655e+00, 5.21736330e+00,
        1.68046726e+02, 5.90629658e+03]),
 '_sklearn_version': '1.0.2'}

Save model for later use

In [235]:
import joblib
joblib.dump(final_model, 'credit-screening-lr.pkl')
Out[235]:
['credit-screening-lr.pkl']
In [236]:
clf = joblib.load('credit-screening-lr.pkl')
clf
Out[236]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('std_scaler',
                                                                   StandardScaler())]),
                                                  ['A2', 'A3', 'A8', 'A11',
                                                   'A14', 'A15']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['A1', 'A4', 'A5', 'A6', 'A7',
                                                   'A9', 'A10', 'A12',
                                                   'A13'])])),
                ('logisticregression',
                 LogisticRegression(C=0.1, max_iter=1000, random_state=48,
                                    solver='liblinear'))])
In [237]:
print(list(dir(clf.named_steps['logisticregression'])))
['C', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_check_feature_names', '_check_n_features', '_estimator_type', '_get_param_names', '_get_tags', '_more_tags', '_predict_proba_lr', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_validate_data', 'class_weight', 'classes_', 'coef_', 'decision_function', 'densify', 'dual', 'fit', 'fit_intercept', 'get_params', 'intercept_', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_features_in_', 'n_iter_', 'n_jobs', 'penalty', 'predict', 'predict_log_proba', 'predict_proba', 'random_state', 'score', 'set_params', 'solver', 'sparsify', 'tol', 'verbose', 'warm_start']
In [238]:
clf.named_steps['logisticregression'].coef_
Out[238]:
array([[-6.58541123e-03, -3.51822076e-03,  3.09711492e-01,
         4.70027755e-01, -1.13323153e-01,  4.35822579e-01,
         1.68285927e-02, -8.75341871e-02,  7.47808800e-02,
         7.28776727e-02, -2.18364147e-01,  7.28776727e-02,
         7.47808800e-02, -2.18364147e-01, -1.39144167e-01,
         4.01314072e-02,  4.11012844e-01,  8.55332882e-03,
        -2.12058549e-02, -3.27693790e-01, -1.76320240e-01,
        -2.68447620e-02, -2.30368098e-01, -2.52030447e-02,
         4.08624678e-02,  5.64167500e-03, -3.60899378e-02,
         4.05962577e-01,  6.25902716e-02,  3.63074013e-02,
        -2.61934006e-01,  1.02733304e-01,  5.97703732e-02,
         8.88998116e-02,  8.68297921e-04, -7.75378510e-02,
        -8.24031970e-02, -1.21443647e+00,  1.14373088e+00,
        -3.02496249e-01,  2.31790655e-01,  1.00373967e-02,
        -8.07429912e-02, -1.80524180e-02, -1.20565649e-02,
        -4.05966115e-02]])