K-Folding, Parametrization of Decision Trees

In the main tutorials we use a lot of libraries which help us to implement the introduced concepts quickly and efficiently. Nevertheless, often one has no idea how such algorithms could actually look like "behind the scenes". Therefore, in this tutorial of the main tutorial, we will provide some examples of algorithms that work similar or in the same way as the algorithms provided by the libraries we use.

In order to look at the algorithms we first need the data (we use the same data as we used in the main tutorial).

In [9]:
#Import Pandas library
import pandas as pd

#List with attribute names (it is optional to do this but it gives a better understanding of the data for a human reader)
attribute_names = ['variance_wavelet_transformed_image', 'skewness_wavelet_transformed_image', 'curtosis_wavelet_transformed_image', 'entropy_image', 'class']

#Read csv-file
data = pd.read_csv('data_banknote_authentication.csv', names=attribute_names)

#Shuffle data
data = data.sample(frac=1)

#Shows the first 5 rows of the data
data.head()
Out[9]:
variance_wavelet_transformed_image skewness_wavelet_transformed_image curtosis_wavelet_transformed_image entropy_image class
413 3.47690 -0.15314 2.530000 2.44950 0
942 -3.37930 -13.77310 17.927400 -2.03230 1
763 0.39012 -0.14279 -0.031994 0.35084 1
1169 0.98296 3.42260 -3.969200 -1.71160 1
768 -1.75820 2.73970 -2.532300 -2.23400 1

K-Folding

In the main tutorial we learned why K-Folding is useful. So, let's have a look at how K-Folding can be solved by getting our hands dirty instead of delegating all the work to Sklearn's algorithms. Therefore, we first split our data into different parts - in our case into five parts.

In [10]:
#Shuffled data
data_shuffled = data.sample(frac=1)
#Get number of rows of our data
row_count = data_shuffled.iloc[:,0].count()
print(row_count)

#Divide number of rows by five 
sub_row_count = int(row_count/5)
print(sub_row_count)

#Split our data after every fith part multiplied by the split number
data_shuffled_split_1 = data_shuffled.iloc[:sub_row_count,:]
data_shuffled_split_2 = data_shuffled.iloc[sub_row_count:2*sub_row_count, :]
data_shuffled_split_3 = data_shuffled.iloc[2*sub_row_count:3*sub_row_count, :]
data_shuffled_split_4 = data_shuffled.iloc[3*sub_row_count:4*sub_row_count, :]
data_shuffled_split_5 = data_shuffled.iloc[4*sub_row_count:5*sub_row_count, :]
1372
274

After having divided our data into 5 different parts, we create different combinations of test and training data out of these splits

In [11]:
# Training data splits: 1-4, test data split: 5
training_data_1 = pd.concat([data_shuffled_split_1,data_shuffled_split_2, data_shuffled_split_3, data_shuffled_split_4])
test_data_1 = data_shuffled_split_5
training_data_1_x = training_data_1.loc[:, data.columns != 'class']
training_data_1_y = training_data_1['class']
test_data_1_x = pd.DataFrame(test_data_1.loc[:, data.columns != 'class'])
test_data_1_y = test_data_1['class']

# Training data splits: 2-5, test data split: 1
training_data_2 = pd.concat([data_shuffled_split_2, data_shuffled_split_3, data_shuffled_split_4, data_shuffled_split_5])
test_data_2 = data_shuffled_split_1
training_data_2_x = training_data_2.loc[:, data.columns != 'class']
training_data_2_y = training_data_2['class']
test_data_2_x = test_data_2.loc[:, data.columns != 'class']
test_data_2_y = test_data_2['class']

# Training data splits: 1, 3-5, test data split: 2
training_data_3 = pd.concat([data_shuffled_split_1, data_shuffled_split_3, data_shuffled_split_4, data_shuffled_split_5])
test_data_3 = data_shuffled_split_2
training_data_3_x = training_data_3.loc[:, data.columns != 'class']
training_data_3_y = training_data_3['class']
test_data_3_x = test_data_3.loc[:, data.columns != 'class']
test_data_3_y = test_data_3['class']

# Training data splits: 1,2 4,5 test data split: 3
training_data_4 = pd.concat([data_shuffled_split_1, data_shuffled_split_2, data_shuffled_split_4, data_shuffled_split_5])
test_data_4 = data_shuffled_split_3
training_data_4_x = training_data_4.loc[:, data.columns != 'class']
training_data_4_y = training_data_4['class']
test_data_4_x = test_data_4.loc[:, data.columns != 'class']
test_data_4_y = test_data_4['class']

# Training data splits: 1, 2, 3, 5 test data split: 4
training_data_5 = pd.concat([data_shuffled_split_1, data_shuffled_split_2, data_shuffled_split_3, data_shuffled_split_5])
test_data_5 = data_shuffled_split_4
training_data_5_x = training_data_5.loc[:, data.columns != 'class']
training_data_5_y = training_data_5['class']
test_data_5_x = test_data_5.loc[:, data.columns != 'class']
test_data_5_y = test_data_5['class']

Great! Now we have 5 different training and test data sets. Let's use them for our Decision Tree! As you know, programmers are lazy and usually do not want to do several times (almost) the same as we did dividing the data into different test and training data sets above. Therefore, let's create a fancy loop that builds the Decision Trees for each training data set and then test it with each test data. In order to do that let's first instantiate a Decision Tree Classifier and then make some lists containing the different training and test data.

In [12]:
#import DecisionTreeClassifier from the Skelarn library
from sklearn.tree import DecisionTreeClassifier

#Create a classifier object 
classifier = DecisionTreeClassifier() 
In [13]:
#create lists with training and test data splits

training_data_x = [training_data_1_x, training_data_2_x, training_data_3_x, training_data_4_x, training_data_5_x]
training_data_y = [training_data_1_y, training_data_2_y, training_data_3_y, training_data_4_y, training_data_5_y]
test_data_x = [test_data_1_x, test_data_2_x, test_data_3_x, test_data_4_x, test_data_5_x]
test_data_y = [test_data_1_y, test_data_2_y, test_data_3_y, test_data_4_y, test_data_5_y]

This fancy loop returns a list with the different accuracies of the different combinations of the test and training data sets. We then use the arithmetic mean of this list to make a better assumption of our tree's performance.

In [14]:
#import numpy library
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix  


#tests our model with all different training and test data splits
#param: training_data_x: list of all training data with x-variables
#param: training_data_y: list of all training data with y-variables
#param: test_data_x: list of all test data with x-variables
#param: test_data_y: list of all test data with y-variables
#return: accuracies: list with tested accuracies using different training and test data splits
def get_accuracies_of_k_fold(training_data_x, training_data_y, test_data_x, test_data_y, classifier):
    accuracies = []
    for count in range(0, len(training_data_x)):
        classifier = classifier.fit(training_data_x[count],training_data_y[count])
        y_pred = classifier.predict(test_data_x[count]) 
        conf_matrix = confusion_matrix(test_data_y[count], y_pred)
        accuracy = (conf_matrix[0,0] + conf_matrix[1,1]) /(conf_matrix[0,0]+conf_matrix[0,1]+ conf_matrix[1,0]+conf_matrix[1,1])
        accuracies.append(accuracy)
    return accuracies


accuracies = get_accuracies_of_k_fold(training_data_x, training_data_y, test_data_x, test_data_y, classifier)
accuracy_mean = np.mean(accuracies)

print(round(accuracy_mean,4))
0.9839

Parametrization of Decision Trees

While our tree makes awesome predictions, in other cases these predictions are usually not instantly this good. This is often related to Overfitting. We did not give our tree any parameters about the maximal depth until which we allow it to grow nor the minimal number of samples for each leaf. So let's make a function that tests all kind of depths and numbers of samples per leave and build different trees wih these parameters. We then save our results in a Python dictionary.

In [15]:
#tests which max_depth and which max_sample_split work best as parameters for building our tree 
#param: training_data_x: list of all training data with x-variables
#param: training_data_y: list of all training data with y-variables
#param: test_data_x: list of all test data with x-variables
#param: test_data_y: list of all test data with y-variables
#param: min_max_depth: min value of interval for best max_depth calculation 
#param: max_depth: max value of interval for best max_depth calculation 
#param: min_samples_split: min value of interval for min_sample_split
#param: max_min_samples_split: max value of interval for min_sample_split  
#return: trees: dicitionary containg information about max_depth, min_sample_split and accuracy_mean for every tree that was built
def find_best_model_and_its_accuracies(training_data_x, training_data_y, test_data_x, test_data_y, min_max_depth, max_depth, min_samples_split, max_min_sample_split):    
    trees = []
    if min_samples_split <=1:
        min_samples_split = 2
        
    if min_max_depth <= 0:
        min_max_depth = 1
        
    for count_depth in range (min_max_depth, max_depth):
        for count_split in range (min_samples_split, max_min_sample_split):
            classifier = DecisionTreeClassifier(max_depth = count_depth, min_samples_split = count_split)
            accuracies = get_accuracies_of_k_fold(training_data_x, training_data_y, test_data_x, test_data_y, classifier)
            mean_accuracies = np.mean(accuracies)
            tree_info = {'max_depth':count_depth, 'min_sample_split':count_split, 'accuracy_mean': mean_accuracies}
            trees.append(tree_info)
    return trees

trees = find_best_model_and_its_accuracies(training_data_x, training_data_y, test_data_x, test_data_y, 1, 20, 2, 20)

With this dictionary we can now detect which depth and which sample size worked best regarding our different test and training data sets and compare it to the accuracy which we had achieved before. As our model already worked pretty well before, the accuracy of the chosen tree only slightly differs from the accuracy of our first tree. We test our result using the determined best parameters while building a new Decision Tree

In [21]:
#detects which parameters achieved the best accuracy
#param: trees: dicitonary with all tested trees
#return: best_tree: dicitonary with the parameters that were given to the tree which reached the highest score
def get_best_tree(trees):
    accuracy_means = []
    best_tree = None

    for tree in trees:
        accuracy_mean = tree.get('accuracy_mean')
        accuracy_means.append(accuracy_mean)
    max_accuracy_mean = max(accuracy_means)
    
    for tree in trees:
        if tree.get('accuracy_mean') == max_accuracy_mean:
            best_tree = tree
            break
    return best_tree

#dicitionary of information about best parameters for tree
best_tree = get_best_tree(trees)

print(best_tree)

#Build new classifier object using best parameters
classifier_new = DecisionTreeClassifier(max_depth = best_tree['max_depth'], min_samples_split = best_tree['min_sample_split'])

#list with accuracies with different test and training sets of the new tree with the best parameters
accuracies_new = get_accuracies_of_k_fold(training_data_x, training_data_y, test_data_x, test_data_y, classifier_new)

#arithmetic mean of the list with the accuracies of the new tree with the best parameters
accuracy_best_tree = np.mean(accuracies_new)

print('accuracy best tree: ' + str(round(accuracy_best_tree,4)))
print('old accuracy: ' + str(round(accuracy_mean, 4)))
{'max_depth': 14, 'min_sample_split': 2, 'accuracy_mean': 0.983941605839416}
accuracy best tree: 0.9847
old accuracy: 0.9839

That's it for today! Hope you liked this Tutorial - soon there will be more. Hurray!