K-Nearest Neighbors, Boxplot, Standard Score, Feature Scaling, Curse of Dimensionality, Missing Values, Confusion Matrix, Classification Report, ROC-Curve, AUROC

Today's data is one of the most used data sets in Machine Learning examples about classification. It's about beautiful Iris flowers and how they can be classified into different types of Iris flowers. This is the url where the data and the attribute information can be found: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ Nevertheless, you can also simply continue with this tutorial because all necessary information are also provided here.

Attribute information

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica

Read Csv

In [2]:
import pandas as pd

#attribute names 
names = ['sepal_length', 'speal_width', 'petal_length', 'petal_width', 'class']

#get data with attribute names
iris_data = pd.read_csv('iris_data.csv', names = names)

#shuffle data
iris_data = iris_data.sample(frac=1)

#print 5 first lines of the data
iris_data.head()
Out[2]:
sepal_length speal_width petal_length petal_width class
138 6.0 3.0 4.8 1.8 Iris-virginica
2 4.7 3.2 1.3 0.2 Iris-setosa
94 5.6 2.7 4.2 1.3 Iris-versicolor
107 7.3 2.9 6.3 1.8 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica

Divide into features and target

In [3]:
#exctract feature variables
x_variables = iris_data.loc[:, iris_data.columns != 'class']

#extract target variable
y_variable = iris_data['class']

Split into training and test data

In [4]:
from sklearn.model_selection import train_test_split 

#get training and test data
x_train, x_test, y_train, y_test = train_test_split(x_variables, y_variable, test_size=0.20)  

Standard score (z-score) and Feature Scaling

Great! We already have our data, we split it into feature and target variables as well as into test and training data. Before we learn how the K-Nearest Neighbors algorithm works, we will have a look at Normalization, what it is and why it is done. Imagine you have features that vary in magnitudes, units and range. Classification calculations could become difficult, for example, if you want to classify a person using weight and height into overweight, normal weight and underweight: centimetres and kilograms are different units and the amount of kilograms that a person weighs is usually less than half of the number of centimetres of a person's height. If we want to make the numbers of two features more comparable we somehow have to scale them to the same level. The most popular ways to do this are Standardization and Feature Scaling: The formula for the Standard Score looks like this:
$z={x-\mu \over \sigma }$ where $\mu$ is the population mean and $\sigma$ is the population standard deviation. After applying this formula the mean becomes 0 and the standard deviation 1. The cool thing about the z-score is that it made the data become much more comparable. For example, if you know that someone's weight z-score is 3, then you know that it is 3 standard deviations above the mean. This is quite a lot! If the distribution is somewhat similar to the Gaussian Normal Distribution then the person's weight being 3 standard deviations away from the mean would tell us that that person's weight is higher than the weight of around 98% of the overall population. Using the z-score has the advantage that outliers have less effect when normalizing data than they have when using Feature Scaling for normalization because it has a wider range of possible values. Nevertheless, this wider range might also give more weight to features with a less equal distribution around the mean due to the higher standard deviations. Thus, especially when applying an algorithm that computes distances Feature Scaling probably is the better choice: The formula for Feature Scaling looks like this:
$X'={\frac {X-X_{\min }}{X_{\max }-X_{\min }}}$ After applying this formula all feature values will be in the range [0, 1] or [−1, 1]. Since the K-Nearest Neighbors algorithm computes distances, we will use Feature Scaling. Let's find out how a Python implementation of feature scaling using the Sklearn library looks like.

In [5]:
from sklearn.preprocessing import MinMaxScaler

#create MinMaxScaler object
scaler_min_max = MinMaxScaler()

#fit object to data
scaler_min_max.fit(x_train)

#get transformed train data
x_train_normalized = scaler_min_max.transform(x_train)

#get transformed test data
x_test_normalized = scaler_min_max.transform(x_test)

K-Nearest Neighbors

Now that we have normalized data we can start building our model. Well, actually using the term "building a model" is not really accurate when applying the K-Nearest Neighbors algorithm. The information about the training data is simply stored somewhere - e.g. in a database. Whenever we want to predict the class membership of a new instance we compare the instance with the stored data of the other instances. The most similar instances and their class membership then decide about the new instance's class membership. How is this similarity being measured? Do you remember the Pythagorean theorem? Exactly: $a^2+b^2 = c^2$! This works perfectly for two dimensions. Applying this to a coordinate system in which we want to measure the distance between two points which represent two instances A(2,3) and B(1,2) the distance would be $\sqrt{(2-1)^2 + (3-2)^2}$. By adding one more dimension which is equal to adding a third feature our points (instances) in the coordinate system could for example look like this: A(2,3,5) and B(1,2,7). The new distance would then be $\sqrt{(2-1)^2 + (3-2)^2 +(5-7)^2}$. The idea is the same as with the Pythagorean theorem - only the name for the formula became another one: the Euclidean Distance. Thus, now we also know why it was so important to normalize the data before applying the algorithm: If one feature has values between 1000 and 2000 and another feature values between 0 to 5, then the feature with the values ranging from 0 to 5 would not be significant anymore when calculating the differences of the distances to the new instance. Therefore, in order to maintain the importance of each feature, their range is equally being normalized to values between 0 and 1. Now that we know about all of this we can finally build our KNeighborsClassifier object using the Sklearn library. As a parameter we can decide how many nearest neighbors and their class memberships will be taken into account when deciding on the new instance's class memebership. How many neighbors should be taken into account depends largely on the data - there is no universal rule - it is all about trying. By specifying a "weights" parameter it is also possible to specify that the nearest neighbors that are closer than the other nearest neighbors get more weight in determining the class membership.

In [6]:
from sklearn.neighbors import KNeighborsClassifier  

#create KNeighborsClassifier object
classifier_normalization = KNeighborsClassifier(n_neighbors=10)  

#fit object to data
classifier_normalization.fit(x_train_normalized, y_train)
Out[6]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform')
In [7]:
#get predicitons
y_pred_normalization = classifier_normalization.predict(x_test_normalized)  
In [8]:
from sklearn.metrics import classification_report, confusion_matrix  

#confusion matrix
print(confusion_matrix(y_test, y_pred_normalization))  

#classifiaction report
print(classification_report(y_test, y_pred_normalization)) 
[[ 9  0  0]
 [ 0  8  0]
 [ 0  1 12]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         9
Iris-versicolor       0.89      1.00      0.94         8
 Iris-virginica       1.00      0.92      0.96        13

      micro avg       0.97      0.97      0.97        30
      macro avg       0.96      0.97      0.97        30
   weighted avg       0.97      0.97      0.97        30

Ok, so far the only thing we know is that apparently the k-nearest neighbors algorithm works really well on our data but some visualization to understand better why this is the case might also be interesting. So let's do it. Before we start there is to mention that this part of the tutorial got most of its inspiration from this website: https://www.kaggle.com/skalskip/iris-data-visualization-and-knn-classification

We start with simple boxplots that show how the range of each feature's charactersitic varies in each class memebership.

In [9]:
import matplotlib.pyplot as plt

#make 4 different boxplots grouped by class memebership
plt.figure()
iris_data.boxplot(by="class", figsize=(15,10))
plt.show()
<Figure size 432x288 with 0 Axes>

This looked rather boring so let's do something a little bit fancier. The description of what you can see is cited from this page: (https://www.kaggle.com/skalskip/iris-data-visualization-and-knn-classification, 31.01.2019). "Parallel coordinates is a plotting technique for plotting multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together."

In [10]:
from pandas.plotting import parallel_coordinates

#figure size
plt.figure(figsize=(15,10))

#define features and class
parallel_coordinates(iris_data, "class")

#Plot title
plt.title('Parallel Coordinates Plot', fontsize=20, fontweight='bold')

#name x-axis
plt.xlabel('Features', fontsize=15)

#name y-axis
plt.ylabel('Features values', fontsize=15)

#legend attributes definiton
plt.legend(loc=1, prop={'size': 15}, frameon=True,shadow=True, facecolor="white", edgecolor="black")
plt.show()

Ok, this already looked much fancier! But we are still caught in a 2D world! Let's add a third dimension! Well, actually we have 4 features. Thus, a fourth dimension would be even better. Unfortunately no human being - at least as far as I know - is capable of imagining something like this. Thus, a little trick was used in the 3D plot: Reagrding the fourth feature its values are being representated by the size of the data points.

Most of the time it is much easier for machine learning models to handle numeric data rather than string data. In this case our 3D plot can not handle string data in order to give different colors to the data points. That's why we use the LabelEncoder object to transform the class membership names into numbers from 0-2.

Curse of Dimensionality

Looking at the 3D plot we can see how small the number of instances appears to be in the big space which the 3D plot provides. Imagine how little the amount of data would look like if we added even more dimensions. Well, yes, you are right, you cannot imagine how that might look like because we are unable to imagine a space with more than three dimensions. Nevertheless, I hope this explained the point: The more dimensions you create by adding more features the more instances you need to fill the space and therefore to be able to make valuable class distinctions. Therefore, when applying the k-nearest neighbor algorithm, it can be very useful to preselect relevant features wisely instead of using all features you have. Remember the earlier example about classifying people into underweight, normal weight and overweight: height and weight definitely seem like important features. However, if we also had information about the person's favourite animal, this might not be relevant for our classification and this feature could be left out for better results.

In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y_variable)

#find out which order classes have
print(le.classes_)

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(20, 15))
ax = Axes3D(fig, elev=48, azim=134)
ax.scatter(x_variables.iloc[:,0], x_variables.iloc[:,1], x_variables.iloc[:,2], c = y,
           cmap=plt.cm.Set1, edgecolor='k', s = x_variables.iloc[:, 3]*150)


#get position for class names in plot
for name, label in [('Iris-setosa', 0), ('Iris-versicolor', 1), ('Iris-virginica', 2)]:
    ax.text3D(x_variables.iloc[y == label, 0].mean(),
              x_variables.iloc[y == label, 1].mean(),
              x_variables.iloc[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'),size=25)

ax.set_title("3D visualization", fontsize=40)
ax.set_xlabel("Sepal Length [cm]", fontsize=25)
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("Sepal Width [cm]", fontsize=25)
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("Petal Length [cm]", fontsize=25)
ax.w_zaxis.set_ticklabels([])

plt.show()
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

Support Vector Machine

So far we learned about how to classify new instances just by looking at its nearest neighbors and their class memebership. The next approach applies a slightly differnet technique: It tries to find a line, a plane or a hyperlane that is able to divide the data into two different classes. Theoretically there are techniques that allow the classification into more classes when using the Support Vector Machine algorithm. Nevertheless, it was originally designed for the division of data into two classes. Therefore, since our previous example has three classes, we will choose a new dataset with only two classes.

In [12]:
from sklearn import datasets

#Load dataset
cancer_data = datasets.load_breast_cancer()

As always, let's get a quick overview about our data. Therefore, we create a Pandas DataFrame which shows us the feature values and the correspondent target value - in our case whehter the patient has cancer or not.

In [13]:
import pandas as pd
import numpy as np

#create dataframe
cancer_df = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names)

#add target column
cancer_df['has_cancer'] = cancer_data.target

#shuffle data
cancer_df = cancer_df.sample(frac=1)

#print 5 first lines of data
cancer_df.head()
Out[13]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension has_cancer
522 11.26 19.83 71.30 388.1 0.08511 0.04413 0.005067 0.005664 0.1637 0.06343 ... 26.43 76.38 435.9 0.1108 0.07723 0.02533 0.02832 0.2557 0.07613 1
17 16.13 20.68 108.10 798.8 0.11700 0.20220 0.172200 0.102800 0.2164 0.07356 ... 31.48 136.80 1315.0 0.1789 0.42330 0.47840 0.20730 0.3706 0.11420 0
8 13.00 21.82 87.50 519.8 0.12730 0.19320 0.185900 0.093530 0.2350 0.07389 ... 30.73 106.20 739.3 0.1703 0.54010 0.53900 0.20600 0.4378 0.10720 0
102 12.18 20.52 77.22 458.7 0.08013 0.04038 0.023830 0.017700 0.1739 0.05677 ... 32.84 84.58 547.8 0.1123 0.08862 0.11450 0.07431 0.2694 0.06878 1
20 13.08 15.71 85.63 520.0 0.10750 0.12700 0.045680 0.031100 0.1967 0.06811 ... 20.49 96.09 630.5 0.1312 0.27760 0.18900 0.07283 0.3184 0.08183 1

5 rows × 31 columns

Let's have a look at the percentage of people having/not having cancer

In [14]:
#amount of instances having cancer
has_cancer = len(cancer_df.loc[cancer_df['has_cancer'] ==1])

#total amount of instances
total_amount = len(cancer_df['has_cancer'])

#amount of instances not having cancer
no_cancer = len(cancer_df.loc[cancer_df['has_cancer'] ==0])

print('percentage of people having cancer: ' + str(round(has_cancer/total_amount,2)))
print('percentage of people not having cancer: ' + str(round(no_cancer/total_amount,2)))
percentage of people having cancer: 0.63
percentage of people not having cancer: 0.37

Missing values

Let's do something which we have not yet done in any of the other tutorials but which we actually should have done each time before applying any algorithms on our data: we search for Missing Values. This is a very essential step in data pre-processing. Usually it is very likely that in some rows there are missing values in one or more columns. For example, imagine you have a questionnaire in which you let people decide whether they want to specify their name or gender. A lot of people will decide not to specify these information. Thus, you will have a lot of missing values in the age and gender columns. Instead, you will often find the abbreviation 'n/a' which stands for "not applicable". If you want to find out whether your data has missing values, you can use the functions isnull() or isna() which do absolutely the same. You will get a good overview if you add the sum() function: It will give you the number of missing values per column. In our case - since we use data from Sklearn - most of the data pre-processing has already been done for us. Nevertheless, for future cases, let's think about what we could do if we encounter missing values: The easiest option would simply be to eliminate those rows from our data. However, regarding the just mentioned example, imagine that more than half of the people would either not have specified their age or their gender or both. This would mean that after the elimination of those instances we would have less than half of the data left compared to the original dataset. Especially for predictions in a more dimensional space - as we learned in the section of the curse of dimensionlity - this could lead to a huge lack of necessary data to fill up the space. Therefore, an other option would be to take the mean/ the most frequent/ the median of the column containing missing values and replace those missing values with that mean / most frequent / median. In the questionneire example this could for example be done by taking the arithmetic mean of the specified ages and replace the missing values in the age column with this mean. Nevertheless, if an instance has a lot of missing values then it might be better to elimniate it from the dataframe to avoid cretating an "artificial" instance.

In [15]:
#two ways to look for missing values per column 
cancer_df.isnull().sum()
cancer_df.isna().sum()
Out[15]:
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
has_cancer                 0
dtype: int64

Now that we know that we do not have to worry about missing values regarding our data, we can divide our data into test and training data.

In [16]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size=0.3,random_state=42)

The Support Vector Machine also uses distances to classifiy data. So, what do we need to do? Right! We have to normalize the data.

In [17]:
from sklearn.preprocessing import MinMaxScaler

#create MinMaxScaler object
scaler_min_max = MinMaxScaler()

#fit object to data
scaler_min_max.fit(X_train)

#get transformed train data
X_train_normalized = scaler_min_max.transform(X_train)

#get transformed test data
X_test_normalized = scaler_min_max.transform(X_test)

Great! Now we are finally prepared to use the Support Vector Machine Algorithm provided by Sklearn. Wait, we still have no idea what the algorithm actually does - so let's get a better understanding! Well, actually the math behind SVMs is really not that easy and unfortunately far exceeds the capacity of this tutorial. Therefore, in this tutorial, we will only learn about the basic idea of SVMs and how to implement them in Python. Nevertheless, don't be shy to learn more about SVMs yourself! This is the link to an amazing tutorial about the math behind SVMs: https://www.svm-tutorial.com/ The same person also wrote a book about SVMs which can be found here: https://www.svm-tutorial.com/2017/10/support-vector-machines-succinctly-released/ As mentioned above, the algorithm tries to find a point/a line/a plane/a hyperplane to divide the data points into their class membership. This is possible if the data is linearly separable. A point is able to separate the data when there is one feature (one dimension), a line when there are two features (two dimensions), a plane when there are three features (three dimensions) and a hyperplane when there are more than three features (n-dimensional space). The algorithm does this by finding a hypothesis function that gives each instance above the hyperplane the label +1 and every instance beneath the hyperplane the label -1. After a hyperplane was found this division can be easily done by plugging in the values of the features of a new instance into the hyperplane equation: If the result is above 0 it gets the label +1, if the result is below zero it gets the label -1. A special case would be a result of 0: this would mean that the instance is on the hyperplane. Then both classifications are possible with a 50% chance of doing the right classification in case the number of instances is equally distributed on both sides of the plane. Now we know that after having found a hyperplane separating our data, it becomes very easy to classify new instances. However, finding this hyperplane unfortunately is not that easy. Therefore, we will learn the basic idea of finding that hyperplane: If our data is linearly separable there is an infinite number of possible hyperplanes that can seperate our data. Finding one of them is not so difficult: It is all about finding a set of weights for the hyperplane equation that leaves no instance of the training set misclassified. Those weights are inititialized randomly then adjusted with each iteration of the loop until every instance is correctly classified. This is called the Perceptron Learning Algorithm (PLA). The problem is that this hyperplane might not be the perfect one: if a hyperplane was found that is very close to the data points of, let's say the instances that were classified as -1, then the probability of misclassifying an instance as +1 increases. Therefore, it is not enough to simply find a hyperplane: We want to find a hyperplane with a maximum distance to the data points on both sides. That distance is then called margin. Thus, the idea about Support Vector Machines is to find a hyperplane which separates the data and has the largest margin.

This was already a lot of information! However, this was just the beginning: Support Vector Machines combine a lot of different ideas, concepts and solutions in just one algorithm. So far, we always supposed that our data was linearly separable - but what happens if this is not the case? Well, as you might have already guessed, SVMs also provide a solution for this case: the so called Kernel. Even if the data is not linearly separable in its orgininal dimension, in an other dimension it probably is linearly separable. In order find the dimension in which the data becomes linearly separable, all feature vectors have to be mapped into a higher dimension. Mapping each feature vector into a higher dimension would cause high computational costs, though. However, maybe you have already heared about the kernel trick! Instead of transforming each feature vector into a higher dimensionional vector, it is enough to apply the dot product between the vectors first and then the kernel function. Therefore, what the Kernel function does is to return the result of a dot product performed in another space. Sounds fancy? Well, it is and there a lot of different types of kernels that can be used for this transformation. In our example we will try several kernels and see which one works best for our data. In real life where you might not have the time to test several kernels, it is often recommended to start with the Gaussian kernel which is also called RBF kernel which is able to map the data into a ∞-dimensional space!

Another literally more down to earth approach, after the high-dimensional space our kernel has taken us to, is to simply accept that not every data point has to be on the right side of the hyperplane. We already learned what overfitting and outliers are. Therefore, if there is a data point that has all the characteristics of one class but has the label of the opposite class it wouldn't make sense to go up in dimensions until this data point is classified with the correct label. Thus, there is an other possible way to handle this: Specifying a C-Value as a parameter determines whether the algorithm should find a wider margin at the cost of some misclassifications, also called soft margin (small C) or a hard margin which does not tolerate any misclassfications (huge C). Unfortunately as in many cases there is no magic value for C that will make the perfect fit for every data. As is so often the case, the only thing we can do is to try several values and find out which one works best for our data. So let's get started:

Well, apparently the best parameter for our SVM model is the linear kernel which means our data is already linearly seperable in its original dimension. Hurray! Therefore, we can also forget about the gamma-parameter because it is only important for the RBF, Poly and Sigmoid kernel.

In [49]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# instantiate SVC object. Probability parameter needed for ROC-Curve in the next step
s = SVC(probability = True)

# define parameters for the GridSearchCV: try differnet kernels, different C values and for 'poly','sigmoid' and 'rbf'kernes different gamma vlaues
parameters = {'kernel':('linear', 'poly','sigmoid','rbf'), 'C':[0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0],'gamma': (1,2,3,'auto')}

# instantiate GridSearchCV object with its parameters 
clf = GridSearchCV(s, parameters, iid=False, cv=5)

# train models
clf.fit(X_train_normalized, y_train)

# check average scores of models
print("accuracy:"+str(np.average(cross_val_score(clf, X_train_normalized, y_train, scoring='accuracy', cv=5))))
print("f1:"+str(np.average(cross_val_score(clf, X_train_normalized, y_train, scoring='f1', cv=5))))

#check which model performed best 
clf.best_params_
accuracy:0.9723717948717947
f1:0.9782539652023455
Out[49]:
{'C': 1.5, 'gamma': 1, 'kernel': 'linear'}
In [19]:
# instantiate new objct with best parameters for our data
best_params_svm = SVC(kernel='linear', probability = True, C = 1.5)

# get predictions
y_pred = clf.predict(X_test_normalized)

Confusion Matrix, Classification Report

In previous tutorials we often used the Confusion Matrix and the Classification Report but we never really had a closer look at what those words like precision, recall and f1-score actually mean. So, let's do it now! Before we start with the Classification Report, let's have a look at the Confusion Matrix first, since all of the mentioned scores can be calculated using the matrix values. The top left value ([0,0]) are the True Negative (TN) values. Thus, the instances that were diagnosed as healthy and are healthy. On the other hand the number on the top right ([0,1]) are the False Positives (FP); the number of people that were diagnosed with cancer without actually having cancer.
The bottom left value ([1,0]) are the False Negatives (FN) which represent the number of people who have cancer but were falsely diagnosed as healthy. And last but not least the value on the bottom right ([1,1]) are the True Positives (TP). That means that the number of instances that were diagnosed with cancer and actually had cancer can be found here. A small mnemonic device: the correctly classified numbers are the ones on the diagonal from the top left to the bottom right and the falsely classified ones are the ones on the diagonal from the top right to the bottom left. Sometimes TP/TN are interchanged as well as FP/FN. This depends on the meaning of 1 and 0 in the target column but the mnemonic device for the diagonal is always the same.

Great! Now we know the basics and are able to find out more about the Classification Report. So let's go through the scores step by step!

Precision: The formula for precision is $\frac{TP}{(TP+FP)}$ which means it tells you how many of your predictions were correct by comparing the number of instances that were correctly classified as having cancer to the whole amount of people who were classified as having cancer. Thus, also those who were falsely classified as malignant. The same is done for $\frac{TN}{(TN+FP)}$.

Recall: The formula for Reacall is $\frac{TP}{(TP+FN)}$ which means that the percentage shows you how many of the positive cases the model was able to catch. In our case this value is very important because it tells you if there were instances who were falsely classified as healthy which in case of cancer could lead to death. Therefore, in our example, this score might even be the most important score. We will not apply this in this tutorial but by using a special parameter for GridSearchCV it is even possible to define which score the model should be optimized for. This is done by using a different threshold value. For example, instead of labelling an instance as malignant only if the probability of that instance having cancer is at least 50%, the model should already label an instace as malignant if the probability of that person having cancer is only 30%.
The same is done for $\frac{TN}{(TN+FP)}$.

F1-Score: The formula for the F1-Score is $\frac{2*Recall*Precision}{(Recall + Precision)}$. This is the harmonic mean between the precision and the recall. Often, the F1-Score is better than the accuracy to indicate how good the model's predictions are. Why? Well, do you remember when we talked about base rate and accuracy? When there is an unequal class distribution then accuracy is not the best score in order to evaluate the performance of your model. Imagine the probability of an instance not having cancer was 99% then a model which always predicts that a person does not have cancer would have an accuracy of 99%. Therefore, the F1-score which takes the class distribution into account, is often the better indicator about the model performance

In [20]:
from sklearn import metrics

#Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

#confusion matrix
cancer_conf_matrix = metrics.confusion_matrix(y_test, y_pred)
print('Confusion Matrix: ')
print(cancer_conf_matrix)

#Classification Report
print('Classification Report')
print(metrics.classification_report(y_test, y_pred))
Accuracy: 0.9883040935672515
Confusion Matrix: 
[[ 61   2]
 [  0 108]]
Classification Report
              precision    recall  f1-score   support

           0       1.00      0.97      0.98        63
           1       0.98      1.00      0.99       108

   micro avg       0.99      0.99      0.99       171
   macro avg       0.99      0.98      0.99       171
weighted avg       0.99      0.99      0.99       171

ROC-Curve, AUROC

A good way of visualizing the confusion matrix as a graph is the ROC-Curve. It plots the True Positive Rate against the False Positive Rate. The ideal Curve would go along the upper right corner. We can see that our plot almost does this which is really good. If our ROC-Curve was on the red line it would mean that our model was only as good as randomly guessing. And if it was below the red line our model would actually make worse predictions than if we randomly guessed if someone has cancer or not.

In [43]:
import matplotlib.pyplot as plt

# get probabilities of class membership of test instances
probs = clf.predict_proba(X_test_normalized)

#get col with probabilities
y_pred_proba = probs[:,1]

# get false positive rate, true positive rate and threshold values
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred_proba)

# Compute Area Under the Curve (AUC) using the trapezoidal rule
roc_auc = metrics.auc(fpr, tpr)
print(roc_auc)

#define figure size 
plt.figure(figsize=(7,7))

#add title
plt.title('ROC Curve')

# plot and add labels to plot
plt.plot(fpr, tpr, 'b', label = 'Normalized data: AUC =  ' + str(round(roc_auc,4)))
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

         #%0.2f' %
0.9979423868312757

The AUROC is the area under the ROC Curve. The closer this value is to 1, the better. It is more complex than the accuracy and can tell how good the model is at distinguishing classes.

In [40]:
print(metrics.roc_auc_score(y_test, y_pred_proba2))
0.9979423868312757

That's it for today! I hope you liked the tutorial! Soon we we will cover many other exciting topics!