My First Classifier

From classes to classifiers, I’ve made my first proper step into machine learning and implemented my first ML algorithm into Python; the k-nearest neighbour classifier.

Here we are in the domain of supervised classification that is discrete response variables with known outcomes. The general idea behind k-nearest neighbours is to imagine the training data plotted in p dimensional space, where each point is labelled with it’s known response. To classify a new point, simply consider the k closest points and choose the majority class in said set of k. Here closeness is defined in terms of the euclidean distance metric. For more information check out an episode from a great, albeit very short series on ML, put out by the Google Dev channel, on which this post is loosely based.

Before diving into code, consider the task at hand, first off we need a dataset. We will use Fisher’s iris dataset which comes with sklearn and consists of 4 features (sepal and petal width and length) and 3 classes (3 subspecies of iris). We then need to split this data into training data (to traing our classifier) and test data (to test it). The difficult bit is making a classifier that can be trained and then used to predict the outcome of new unseen data.

To get out data imported and split into train and test we have the following

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
x = iris.data
y = iris.target

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=30)

Now to the meat of the problem; building the classifier which gives us the chance to show off our new classes. Since we want to specify the number of neighbours to consider, when we define the __init__ method we’ll pass the argument k. Just like with the standard classifiers in sklearn, we want the ability to both fit and predict so we will define a method for each. The __init__ and fit methods are self explanatory, the crux of this lies in the predict method.

For each new point to be classified in x_test we consider the distance to all other points in the training data x_train using distance from scipy.spatial. Finding the index of the k nearest neighbours we can then find their respective classes. Using mode from scipy.stats we find the majority class and use this as our prediction.

from scipy.spatial import distance
import numpy as np
import scipy.stats as stats

class MyKNN():

    def __init__(self,k):
        self.k = k

    def fit(self,x_train,y_train):
        self.x = x_train
        self.y = y_train

    def predict(self,x_test):
        predictions = np.array([])
        for i in x_test:
            distances = np.array([])
            for j in self.x:
                dist = distance.euclidean(i,j)
                distances = np.append(distances,dist)
            #index of k nearest neighbours
            knn_index = np.argsort(distances)[:self.k]
            #classification of k nearest neighbours (list)
            knn_class = [self.y[k] for k in knn_index]
            label = stats.mode(knn_class)
            predictions = np.append(predictions, label[0][0])
        return predictions

Finally we instantiate our class with k=2 (chosen arbitrarily), call the fit method on our training data and call the predict method on our training data. Importing an accuracy function from sklearn allows us to judge how well our first classifier performs…

clf = MyKNN(2)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,pred))

93.3% accuracy, not bad for a first shot and on that note it is safe to say that I am no longer an ML zero.

I’ve recently made a start on Kevin Murphy’s mighty 1000 page book Machine Learning: A Probabilistic Perspective which so far has been proven to be a perfect level of difficulty. As I work my way through this, I hope to implement some more advanced ML algorithms into Python and I’ll be sure to keep this space updated with anything I hack together.