This dataset has been downloaded from UC Irvine Machine Learning Repository.
This dataset is regarding evaluation of cars.
The target variable/label is car acceptability and has four categories : unacceptable, acceptable, good and very good.
The input attributes fall under two broad categories - Price and Technical Characteristics.
We have identified : this is an imbalanced dataset with skewed class (output category/label) proportions.
The objective is here to build a model to give multiclass classifier model based on the input attributes.
Summary of Key information
Number of Instances/training examples : 1728
Number of Instances with missing attributes : 0
Number of qualified Instances/training examples : 0
Number of Input Attributes : 6
Number of categorical attributes : 6
Number of numerical attributes : 0
Target Attribute Type : Multi class label
Target Class distribution : 70%:22%:3.9%:3.7%
Problem Identification : Multiclass Classification with imbalanced data set
# Data Wrangling, inspection
import numpy as np
import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt
# Data preprocessing
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
# sklearn ml models
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
# Evaluation metrics
from sklearn.metrics import recall_score, precision_score, \
accuracy_score, plot_confusion_matrix, classification_report, f1_score
pathname = "/Users/bhaskarroy/BHASKAR FILES/BHASKAR CAREER/Career/Skills/Data Science/Practise/Python/UCI Machine Learning Repository/car"
path0 = "/car.c45-names"
path1 = "/car.data"
path2 = "/car.names"
pathdata = pathname + path1
pathcolname = pathname + path0
pathdatadesc = pathname + path2
with open(pathdatadesc) as f:
print(f.read())
with open(pathcolname) as f:
print(f.read())
We will prepare the data for :
Following actions were undertaken:
colnames = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data = pd.read_csv(pathdata, names = colnames, index_col = False)
data
data.info()
data.describe()
data.dtypes
# Inspect if any missing values present
data.isnull().sum()
for i in data.columns :
print(f'{i} : {data[i].unique().tolist()}')
cat_vars = data.columns
catdict = {
"buying": ['low','med','high', 'vhigh' ],
"maint": ['low','med','high', 'vhigh' ],
"doors": ['2', '3', '4', '5more'],
"persons" : ['2', '4', 'more'],
"lug_boot" : ['small', 'med', 'big'],
"safety" : ['low', 'med', 'high'],
"class":['unacc', 'acc','good', 'v-good']
}
for i in cat_vars :
data[i] = pd.Categorical(data[i],
categories=catdict[i], ordered=True)
data.dtypes
def show(data):
for i in data.columns[0:]:
print("Feature: {} with {} Levels".format(i,data[i].unique()))
show(data)
data.isnull().sum()
#Accessing colors from external library Palettable
from palettable.cartocolors.qualitative import Bold_10
colors = Bold_10.mpl_colors
#colors = plt.cm.Dark2(range(15))
#colors = plt.cm.tab20(range(15))
#.colors attribute for listed colormaps
#colors = plt.cm.tab10.colors
#colors = plt.cm.Paired.colors
# custom function for easy and efficient analysis of categorical univariate
def UVA_category(data_frame, var_group = [], **kargs):
'''
Stands for Univariate_Analysis_categorical
takes a group of variables (category) and plot/print all the value_counts and horizontal barplot.
- data_frame : The Dataframe
- var_group : The list of column names for univariate plots need to be plotted
The keyword arguments are as follows :
- col_count : The number of columns in the plot layout. Default value set is 2.
For instance, if there are are 4# columns in var_group, then 4# univariate plots will be plotted in 2X2 layout.
- colwidth : width of each plot
- rowheight : height of each plot
- normalize : Whether to present absolute values or percentage
- sort_by : Whether to sort the bars by descending order of values
- axlabel_fntsize : Fontsize of x axis and y axis labels
- infofntsize : Fontsize in the info of unique value counts
- axlabel_fntsize : fontsize of axis labels
- axticklabel_fntsize : fontsize of axis tick labels
- infofntfamily : Font family of info of unique value counts.
Choose font family belonging to Monospace for multiline alignment.
Some choices are : 'Consolas', 'Courier','Courier New', 'Lucida Sans Typewriter','Lucidatypewriter','Andale Mono'
https://www.tutorialbrain.com/css_tutorial/css_font_family_list/
- max_val_counts : Number of unique values for which count should be displayed
- nspaces : Length of each line for the multiline strings in the info area for value_counts
- ncountspaces : Length allocated to the count value for the unique values in the info area
- show_percentage : Whether to show percentage of total for each unique value count
Also check link for formatting syntax : https://pyformat.info/#number
'''
import textwrap
data = data_frame.copy(deep = True)
# Using dictionary with default values of keywrod arguments
params_plot = dict(colcount = 2, colwidth = 7, rowheight = 4, normalize = False, sort_by = "Values")
params_fontsize = dict(axlabel_fntsize = 10,axticklabel_fntsize = 8, infofntsize = 10)
params_fontfamily = dict(infofntfamily = 'Andale Mono')
params_max_val_counts = dict(max_val_counts = 10)
params_infospaces = dict(nspaces = 10, ncountspaces = 4)
params_show_percentage = dict(show_percentage = True)
# Updating the dictionary with parameter values passed while calling the function
params_plot.update((k, v) for k, v in kargs.items() if k in params_plot)
params_fontsize.update((k, v) for k, v in kargs.items() if k in params_fontsize)
params_fontfamily.update((k, v) for k, v in kargs.items() if k in params_fontfamily)
params_max_val_counts.update((k, v) for k, v in kargs.items() if k in params_max_val_counts)
params_infospaces.update((k, v) for k, v in kargs.items() if k in params_infospaces)
params_show_percentage.update((k, v) for k, v in kargs.items() if k in params_show_percentage)
#params = dict(**params_plot, **params_fontsize)
# Initialising all the possible keyword arguments of doc string with updated values
colcount = params_plot['colcount']
colheight = params_plot['colheight']
rowheight = params_plot['rowheight']
normalize = params_plot['normalize']
sort_by = params_plot['sort_by']
axlabel_fntsize = params_fontsize['axlabel_fntsize']
axticklabel_fntsize = params_fontsize['axticklabel_fntsize']
infofntsize = params_fontsize['infofntsize']
infofntfamily = params_fontfamily['infofntfamily']
max_val_counts = params_max_val_counts['max_val_counts']
nspaces = params_infospaces['nspaces']
ncountspaces = params_infospaces['ncountspaces']
show_percentage = params_show_percentage['show_percentage']
if len(var_group) == 0:
var_group = df.select_dtypes(exclude = ['number']).columns.to_list()
import matplotlib.pyplot as plt
plt.rcdefaults()
# setting figure_size
size = len(var_group)
#rowcount = 1
#colcount = size//rowcount+(size%rowcount != 0)*1
colcount = colcount
#print(colcount)
rowcount = size//colcount+(size%colcount != 0)*1
plt.figure(figsize = (colwidth*colcount,rowheight*rowcount), dpi = 150)
# Converting the filtered columns as categorical
for i in var_group:
#data[i] = data[i].astype('category')
data[i] = pd.Categorical(data[i])
# for every variable
for j,i in enumerate(var_group):
#print('{} : {}'.format(j,i))
norm_count = data[i].value_counts(normalize = normalize).sort_index()
n_uni = data[i].nunique()
if sort_by == "Values":
norm_count = data[i].value_counts(normalize = normalize).sort_values(ascending = False)
n_uni = data[i].nunique()
#Plotting the variable with every information
plt.subplot(rowcount,colcount,j+1)
sns.barplot(x = norm_count, y = norm_count.index , order = norm_count.index)
if normalize == False :
plt.xlabel('count', fontsize = axlabel_fntsize )
else :
plt.xlabel('fraction/percent', fontsize = axlabel_fntsize )
plt.ylabel('{}'.format(i), fontsize = axlabel_fntsize )
ax = plt.gca()
# textwrapping
ax.set_yticklabels([textwrap.fill(str(e), 20) for e in norm_count.index], fontsize = axticklabel_fntsize)
#print(n_uni)
#print(type(norm_count.round(2)))
# Functions to convert the pairing of unique values and value_counts into text string
# Function to break a word into multiline string of fixed width per line
def paddingString(word, nspaces = 20):
i = len(word)//nspaces \
+(len(word)%nspaces > 0)*(len(word)//nspaces > 0)*1 \
+ (len(word)//nspaces == 0)*1
strA = ""
for j in range(i-1):
strA = strA+'\n'*(len(strA)>0)+ word[j*nspaces:(j+1)*nspaces]
# insert appropriate number of white spaces
strA = strA + '\n'*(len(strA)>0)*(i>1)+word[(i-1)*nspaces:] \
+ " "*(nspaces-len(word)%nspaces)*(len(word)%nspaces > 0)
return strA
# Function to convert Pandas series into multi line strings
def create_string_for_plot(ser, nspaces = nspaces, ncountspaces = ncountspaces, \
show_percentage = show_percentage):
'''
- nspaces : Length of each line for the multiline strings in the info area for value_counts
- ncountspaces : Length allocated to the count value for the unique values in the info area
- show_percentage : Whether to show percentage of total for each unique value count
Also check link for formatting syntax : https://pyformat.info/#number
'''
str_text = ""
for index, value in ser.items():
str_tmp = paddingString(str(index), nspaces)+ " : " \
+ " "*(ncountspaces-len(str(value)))*(len(str(value))<= ncountspaces) \
+ str(value) \
+ (" | " + "{:4.1f}%".format(value/ser.sum()*100))*show_percentage
str_text = str_text + '\n'*(len(str_text)>0) + str_tmp
return str_text
#print(create_string_for_plot(norm_count.round(2)))
#Ensuring a maximum of 10 unique values displayed
if norm_count.round(2).size <= max_val_counts:
text = '{}\nn_uniques = {}\nvalue counts\n{}' \
.format(i, n_uni,create_string_for_plot(norm_count.round(2)))
ax.annotate(text = text,
xy = (1.1, 1), xycoords = ax.transAxes,
ha = 'left', va = 'top', fontsize = infofntsize, fontfamily = infofntfamily)
else :
text = '{}\nn_uniques = {}\nvalue counts of top 10\n{}' \
.format(i, n_uni,create_string_for_plot(norm_count.round(2)[0:max_val_counts]))
ax.annotate(text = text,
xy = (1.1, 1), xycoords = ax.transAxes,
ha = 'left', va = 'top', fontsize = infofntsize, fontfamily = infofntfamily)
plt.gcf().tight_layout()
from eda import eda_overview
eda_overview.UVA_category(data, data.columns,
rowheight = 3, normalize = False);
from eda import eda_overview
eda_overview.UVA_category(data, data.columns,
rowheight = 3, normalize = True);
from eda import composite_plots
composite_plots.bar_counts(data, data.columns)
For the car to be acceptable, it has to be low atleast in one of the pricing parameters - maintenance or buying price
dv = "buying"
df = pd.crosstab([data[dv]],[data['class']])
df.head()
df.plot(kind='bar', stacked = True, title=dv )
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.ylabel('Class Frequency')
data.columns
from eda.composite_plots import features_plots
features_plots(data, ['buying', 'maint'], 'class')
pd.crosstab(index = [data['buying'],data['maint']], columns = data['class'], margins = False).transpose()
data.pivot_table(index=['buying','maint'], aggfunc='size')
df3 = data.groupby(["buying", "maint"]).size().reset_index(name="value_count")
sns.barplot(x = 'buying', y = 'value_count', hue = 'maint', data = df3)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
CAR : car acceptability
PRICE : overall price
TECH : technical characteristics
safety estimated safety of the car
# Random inspection - price vs class
pd.set_option('display.max_rows', 20)
df4 = pd.DataFrame({'count' : data.groupby(by = ['buying','maint','class']).size()}).reset_index()
df4.style.background_gradient(cmap='Blues')
#https://stackoverflow.com/questions/12286607/making-heatmap-from-pandas-dataframe
df4
#pd.DataFrame(data[['buying','maint','class']].value_counts())
Price has two components - buying price and maintenance.
pd.crosstab(index = [data['buying']], columns = data['class'], margins = False)
from eda.composite_plots import heatmap_plot
heatmap_plot(data, index = ['buying'], column = ['class'])
pd.crosstab(index = [data['maint']], columns = data['class'], margins = False)
fig, ax = plt.subplots(figsize=(4,3))
#font_kwds = dict(fontsize = 12)
df_MainVsClass = pd.crosstab(index = [data['maint']], columns = data['class'], margins = False)
sns.heatmap(df_MainVsClass)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 12)
plt.xlabel(ax.get_xlabel(), fontsize = 15, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 15, fontweight = 'heavy')
df_BuyingVsMaintenance = pd.crosstab(index = [data['buying']], columns = data['maint'], margins = False)
df_BuyingVsMaintenance
fig, ax = plt.subplots(figsize=(4,3))
#font_kwds = dict(fontsize = 12)
sns.heatmap(df_BuyingVsMaintenance)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 12)
plt.xlabel(ax.get_xlabel(), fontsize = 15, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 15, fontweight = 'heavy')
df_PricingVsClass = pd.crosstab(index = [data['buying'],data['maint']], columns = data['class'], margins = False)
df_PricingVsClass
fig, ax = plt.subplots(figsize=(5,4))
#font_kwds = dict(fontsize = 12)
sns.heatmap(df_PricingVsClass)
ax.tick_params(axis = 'both', labelcolor = 'black', labelsize = 10)
plt.xlabel(ax.get_xlabel(), fontsize = 8, fontweight = 'heavy')
plt.ylabel(ax.get_ylabel(), fontsize = 8, fontweight = 'heavy')
Comfort has three aspects - doors, luggage boot size and persons
from eda.composite_plots import heatmap_plot
heatmap_plot(data, ['lug_boot'], ['class'])
pd.crosstab([data['doors'], data['lug_boot']], [data['class']])
from eda.composite_plots import heatmap_plot
heatmap_plot(data, ['doors','lug_boot'], ['class'])
data.columns
This is a multiclass classification problem.
Data is imbalanced as all the classes are not equally represented.
Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.
Instead, we can use below performance measures :
Appropriateness of the performance measures :
Note : It is common practice to measure efficacy of binary classification models using the Area under the Curve (AUC) of the ROC curve. Multi class classicfication will require some tweaks in the ROC AUC approach. We are not pursuing that in this notebook. Read this for more info :https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
from eda.axes_utils import Add_valuecountsinfo, Add_data_labels
ax = data['class'].value_counts().plot(kind = 'bar', figsize=(3,3))
ax.yaxis.grid(True, alpha = 0.3)
ax.set(axisbelow = True)
Add_data_labels(ax.patches)
Add_valuecountsinfo(ax, 'class', data)
from eda import axes_utils
print(list(dir(axes_utils)))
Techniques to correctly distinguish the minority class can be categorized into four main groups, depending on how they deal with the problem. The four groups are :
Learning from imbalanced datasets by Alberto Fernandez et al
X_cols = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
Y_cols = ['class']
X = data[X_cols]
Y = data[Y_cols]
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)
X_train
# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)
#Verifying order of columns
#cat = ["buying","maint","doors","persons","lug_boot","safety"]
print(X_train.columns)
print(X_test.columns)
from sklearn.preprocessing import OrdinalEncoder
#https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py
# prepare input data
def prepare_inputs(X_train, X_test):
oe = OrdinalEncoder(categories = [['low','med','high', 'vhigh' ],
['low','med','high', 'vhigh' ],
['2', '3', '4', '5more'],
['2', '4', 'more'],
['small', 'med', 'big'],
['low', 'med', 'high']]
)
oe.fit(X_train)
X_train_enc = oe.transform(X_train)
X_test_enc = oe.transform(X_test)
return X_train_enc, X_test_enc
from sklearn.preprocessing import LabelEncoder
# prepare input data
def prepare_targets(y_train, y_test):
le = LabelEncoder()
le.fit(np.ravel(y_train))
y_train_enc = le.transform(np.ravel(y_train))
y_test_enc = le.transform(np.ravel(y_test))
return y_train_enc, y_test_enc
from sklearn.preprocessing import OrdinalEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
Comparison of below three scenarios do not justify feature selection methods.
Note we have applied Logistic regression classifier on all three scenarios to check any significant movement in the accuracy scores.
# fit the model using all the features
model = LogisticRegression(solver='lbfgs', max_iter = 1000)
model.fit(X_train_enc, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_enc)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
# feature selection
def select_features(X_train, y_train, X_test):
fs = SelectKBest(score_func=mutual_info_classif, k=4)
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()
# fit the model
model = LogisticRegression(solver='lbfgs',max_iter= 1000)
model.fit(X_train_fs, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
from sklearn.feature_selection import chi2
# feature selection
def select_features2(X_train, y_train, X_test):
fs = SelectKBest(score_func=chi2, k=4)
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# feature selection
X_train_fs2, X_test_fs2, fs2 = select_features2(X_train_enc, y_train_enc, X_test_enc)
# what are scores for the features
for i in range(len(fs2.scores_)):
print('Feature %d: %f' % (i, fs2.scores_[i]))
# plot the scores
plt.bar([i for i in range(len(fs2.scores_))], fs2.scores_)
plt.show()
# fit the model
model = LogisticRegression(solver='lbfgs',max_iter= 1000)
model.fit(X_train_fs2, y_train_enc)
# evaluate the model
yhat = model.predict(X_test_fs2)
# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
Useful links :
A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.
# Sample code for calculating time
st_time = time.time()
#code.......
#code.......
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
# Refer to section 8 for the evaluation metrics
def evaluation_parametrics(y_train,yp_train,y_test,yp_test):
print("--------------------------------------------------------------------------")
print("Classification Report for Train Data")
print(classification_report(y_train, yp_train))
print("Classification Report for Test Data")
print(classification_report(y_test, yp_test))
print("--------------------------------------------------------------------------")
# Accuracy
print("Accuracy on Train Data is: {}".format(round(accuracy_score(y_train,yp_train),2)))
print("Accuracy on Test Data is: {}".format(round(accuracy_score(y_test,yp_test),2)))
print("--------------------------------------------------------------------------")
# Precision
print("Precision on Train Data is: {}".format(round(precision_score(y_train,yp_train,average = "weighted"),2)))
print("Precision on Test Data is: {}".format(round(precision_score(y_test,yp_test,average = "weighted"),2)))
print("--------------------------------------------------------------------------")
# Recall
print("Recall on Train Data is: {}".format(round(recall_score(y_train,yp_train,average = "weighted"),2)))
print("Recall on Test Data is: {}".format(round(recall_score(y_test,yp_test,average = "weighted"),2)))
print("--------------------------------------------------------------------------")
# F1 Score
print("F1 Score on Train Data is: {}".format(round(f1_score(y_train,yp_train,average = "weighted"),2)))
print("F1 Score on Test Data is: {}".format(round(f1_score(y_test,yp_test,average = "weighted"),2)))
print("--------------------------------------------------------------------------")
# Creating custom method to populate a dictionary object with evaluation scores of the classifiers
def create_dict(model, modelname, y_train, yp_train, y_test, yp_test):
dict1 = {modelname : {"F1" : {"Train": float(np.round(f1_score(y_train,yp_train,average = "weighted"),2)),
"Test": float(np.round(f1_score(y_test,yp_test,average = "weighted"),2))},
"Recall": {"Train": float(np.round(recall_score(y_train,yp_train,average = "weighted"),2)),
"Test": float(np.round(recall_score(y_test,yp_test,average = "weighted"),2))},
"Precision" :{"Train": float(np.round(precision_score(y_train,yp_train,average = "weighted"),2)),
"Test": float(np.round(precision_score(y_test,yp_test,average = "weighted"),2))
}}
}
return dict1
dict = {}
def plot_learning_curve(
estimator,
title,
X,
y,
axes=None,
ylim=None,
cv=None,
n_jobs=None,
train_sizes=np.linspace(0.1, 1.0, 5),
):
"""
Generate 3 plots: the test and training learning curve, the training
samples vs fit times curve, the fit times vs score curve.
Parameters
----------
estimator : estimator instance
An estimator instance implementing `fit` and `predict` methods which
will be cloned for each validation.
title : str
Title for the chart.
X : array-like of shape (n_samples, n_features)
Training vector, where ``n_samples`` is the number of samples and
``n_features`` is the number of features.
y : array-like of shape (n_samples) or (n_samples, n_features)
Target relative to ``X`` for classification or regression;
None for unsupervised learning.
axes : array-like of shape (3,), default=None
Axes to use for plotting the curves.
ylim : tuple of shape (2,), default=None
Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).
cv : int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, default=None
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like of shape (n_ticks,)
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the ``dtype`` is float, it is regarded
as a fraction of the maximum size of the training set (that is
determined by the selected validation method), i.e. it has to be within
(0, 1]. Otherwise it is interpreted as absolute sizes of the training
sets. Note that for classification the number of samples usually have
to be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
"""
if axes is None:
_, axes = plt.subplots(1, 3, figsize=(20, 5))
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
# Plot learning curve
axes[0].grid()
axes[0].fill_between(
train_sizes,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1,
color="r",
)
axes[0].fill_between(
train_sizes,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1,
color="g",
)
axes[0].plot(
train_sizes, train_scores_mean, "o-", color="r", label="Training score"
)
axes[0].plot(
train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
)
axes[0].legend(loc="best")
# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, "o-")
axes[1].fill_between(
train_sizes,
fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std,
alpha=0.1,
)
axes[1].set_xlabel("Training examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the model")
# Plot fit_time vs score
fit_time_argsort = fit_times_mean.argsort()
fit_time_sorted = fit_times_mean[fit_time_argsort]
test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
test_scores_std_sorted = test_scores_std[fit_time_argsort]
axes[2].grid()
axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
axes[2].fill_between(
fit_time_sorted,
test_scores_mean_sorted - test_scores_std_sorted,
test_scores_mean_sorted + test_scores_std_sorted,
alpha=0.1,
)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance of the model")
return plt
lr = LogisticRegression(max_iter = 1000,random_state = 48, multi_class = 'multinomial')
st_time = time.time()
lr.fit(X_train_enc,y_train_enc)
yp_train_enc = lr.predict(X_train_enc)
yp_test_enc = lr.predict(X_test_enc)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(lr,X_test_enc, y_test_enc)
dict1 = create_dict(lr, "Logistic Regression Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(lr, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Logistic Regression Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
dt = DecisionTreeClassifier(max_depth = 7,random_state = 48) # Keeping max_depth = 7 to avoid overfitting
dt.fit(X_train_enc,y_train_enc)
yp_train_enc = dt.predict(X_train_enc)
yp_test_enc = dt.predict(X_test_enc)
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(dt,X_test_enc, y_test_enc)
dict1 = create_dict(dt, "Decision Tree Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(dt, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Decision Tree Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
# training a KNN classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7)
st_time = time.time()
knn.fit(X_train_enc,y_train_enc)
yp_train_enc = knn.predict(X_train_enc)
yp_test_enc = knn.predict(X_test_enc)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(knn,X_test_enc, y_test_enc)
dict1 = create_dict(knn, "K Nearest Neighbor Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(knn, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Knn Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
# training a Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
st_time = time.time()
gnb.fit(X_train_enc,y_train_enc)
yp_train_enc = gnb.predict(X_train_enc)
yp_test_enc = gnb.predict(X_test_enc)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(gnb,X_test_enc, y_test_enc)
dict1 = create_dict(gnb, "Naive Bayes Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(gnb, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Naive Bayes Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
rf = RandomForestClassifier(max_depth = 7,random_state = 48) # Keeping max_depth = 7 same as DT
st_time = time.time()
rf.fit(X_train_enc,y_train_enc)
yp_train_enc = rf.predict(X_train_enc)
yp_test_enc = rf.predict(X_test_enc)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(rf,X_test_enc, y_test_enc)
dict1 = create_dict(rf, "Random Forest Classifier", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(rf, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Random Forest Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)
st_time = time.time()
svm.fit(X_train_enc,y_train_enc)
yp_train_enc = svm.predict(X_train_enc)
yp_test_enc = svm.predict(X_test_enc)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(svm,X_test_enc, y_test_enc)
dict1 = create_dict(svm, "Linear SVC", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(svm, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Linear SVC Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=10)
st_time = time.time()
gb_model.fit(X_train_enc,y_train_enc)
yp_train_enc = gb_model.predict(X_train_enc)
yp_test_enc = gb_model.predict(X_test_enc)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
evaluation_parametrics(y_train_enc,yp_train_enc,y_test_enc,yp_test_enc)
plot_confusion_matrix(svm,X_test_enc, y_test_enc)
dict1 = create_dict(gb_model, "Gradient Boosting", y_train_enc, yp_train_enc, y_test_enc, yp_test_enc)
dict.update(dict1)
plot_learning_curve(gb_model, X = X_train_enc, y = y_train_enc,
title = "Learning Curves (Gradient Boosting Classifier)",
train_sizes=np.linspace(0.1, 1.0, 5))
st_time = time.time()
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
# Retrieving the perfomance scores from the dict object
pd.DataFrame.from_dict({(i,j): dict[i][j]
for i in dict.keys()
for j in dict[i].keys()},
orient='index')
# Retrieving the perfomance scores from the dict object
# Transposing the rows and headers with respect to the previous code line
# Tabulating the scores for the different classifiers
user_ids = []
frames = []
for user_id, d in dict.items():
user_ids.append(user_id)
frames.append(pd.DataFrame.from_dict(d, orient='columns'))
df = pd.concat(frames, keys=user_ids)
df.unstack(level = -1).style.background_gradient(cmap='Blues')
Learning from imbalanced data: open challenges and future directions by Bartosz Krawczyk
Application Area | Problem Description |
---|---|
Activity Recognition | Detection of rare or less-frequent activities (multi-class problem) |
Behavior Analysis | Recognition of dangerous behavior (binary problem) |
Cancer Malignancy grading | Analyzing the cancer severity (binary and multi-class problem) |
Hyperspectral data analysis | Classification of varying areas in multi-dimensional images (multi-class problem) |
Industrial Systems monitoring | Fault detection in industrial machinery (binary problem) |
Sentiment analysis | Emotion and temper recognition in text (binary and multi-class problem) |
Software defect prediction | Recognition of errors in code blocks (binary problem) |
Target detection | Classification of specified targets appearing with varied frequency (multi-class problem) |
Text mining | Detecting relations in literature (binary problem) |
Video mining | Recognizing objects and actions in video sequences (binary and multi-class problem) |