My First Classifier

From classes to classifiers, I’ve made my first proper step into machine learning and implemented my first ML algorithm into Python; the k-nearest neighbour classifier.

Here we are in the domain of supervised classification that is discrete response variables with known outcomes. The general idea behind k-nearest neighbours is to imagine the training data plotted in p dimensional space, where each point is labelled with it’s known response. To classify a new point, simply consider the k closest points and choose the majority class in said set of k. Here closeness is defined in terms of the euclidean distance metric. For more information check out an episode from a great, albeit very short series on ML, put out by the Google Dev channel, on which this post is loosely based.

Before diving into code, consider the task at hand, first off we need a dataset. We will use Fisher’s iris dataset which comes with sklearn and consists of 4 features (sepal and petal width and length) and 3 classes (3 subspecies of iris). We then need to split this data into training data (to traing our classifier) and test data (to test it). The difficult bit is making a classifier that can be trained and then used to predict the outcome of new unseen data.

To get out data imported and split into train and test we have the following

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
x = iris.data
y = iris.target

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=30)

Now to the meat of the problem; building the classifier which gives us the chance to show off our new classes. Since we want to specify the number of neighbours to consider, when we define the __init__ method we’ll pass the argument k. Just like with the standard classifiers in sklearn, we want the ability to both fit and predict so we will define a method for each. The __init__ and fit methods are self explanatory, the crux of this lies in the predict method.

For each new point to be classified in x_test we consider the distance to all other points in the training data x_train using distance from scipy.spatial. Finding the index of the k nearest neighbours we can then find their respective classes. Using mode from scipy.stats we find the majority class and use this as our prediction.

from scipy.spatial import distance
import numpy as np
import scipy.stats as stats

class MyKNN():

    def __init__(self,k):
        self.k = k

    def fit(self,x_train,y_train):
        self.x = x_train
        self.y = y_train

    def predict(self,x_test):
        predictions = np.array([])
        for i in x_test:
            distances = np.array([])
            for j in self.x:
                dist = distance.euclidean(i,j)
                distances = np.append(distances,dist)
            #index of k nearest neighbours
            knn_index = np.argsort(distances)[:self.k]
            #classification of k nearest neighbours (list)
            knn_class = [self.y[k] for k in knn_index]
            label = stats.mode(knn_class)
            predictions = np.append(predictions, label[0][0])
        return predictions

Finally we instantiate our class with k=2 (chosen arbitrarily), call the fit method on our training data and call the predict method on our training data. Importing an accuracy function from sklearn allows us to judge how well our first classifier performs…

clf = MyKNN(2)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,pred))

93.3% accuracy, not bad for a first shot and on that note it is safe to say that I am no longer an ML zero.

I’ve recently made a start on Kevin Murphy’s mighty 1000 page book Machine Learning: A Probabilistic Perspective which so far has been proven to be a perfect level of difficulty. As I work my way through this, I hope to implement some more advanced ML algorithms into Python and I’ll be sure to keep this space updated with anything I hack together.

Learning Machine Learning

Inspired by a recent video by John Green, My Information DietI’ve decided to revamp my current information intake which consists of frankly far too much Facebook. As a fan of of the medium, I turned by attention to finding some data science/machine learning podcasts and almost instantly came across Partially Derivative whose latest post was entitled Learning Machine Learning.

This podcast laid out a syllabus for machine learning under the ideology of combining 3 key areas: Theory, Application and Immersion, before going on to recommend a decent amount of resources. The books are expensive however a large portion of them appear to be online for free in pdf form, here are a selection of the ones mentioned in the podcast.

Theory

Application

Immersion

I feel like this will provide a pretty good grounding in the field and gives a really decent structure to what I’ll be looking at in the not to distance future.

Hello World

Hi all, my name’s Alex and I’m a final year Mathematics student at Leeds uni in the UK. During my degree have started programming, mostly in Python (a little in R) and would say I know my way around the basics fairly well (classes are still sorcery to me at the moment) but my main knowledge is out of a necessity to do maths. I’ve done a couple of larger projects including my dissertation/final year project on network dynamics which involved implementing simple algorithms to model opinion dynamics. The other project was a summer research project about random forests and decision trees and relied heavily on the machine learning library scikit learn, which allowed me to jump right into machine learning without understanding all that much. A combination of this and watching endless interviews with Demis Hassabis of Deepmind has inspired me to crack on and dive deeper into machine learning.

The intention of this blog is to share and document the process of learning more programming and machine learning as I go from Zero to Machine Learning Hero (lame attempt originality is a work in progress). Hopefully this will provide me with motivation for a developing hobby and, at a later date, even be a source of inspiration and information for other. With this in mind I’ll provide links to resources I have found useful and (maybe) explanations of new things what I learnt in attempts to solidify my own understanding. I’m currently relying on the books Introduction to/Elements of Statistical Learning (ISL/ESL) as well as the Google Developers machine learning series.

In terms of what can be expected of this space some short terms goals of things I want to tackle include
  • Python classes
  • My first ‘home grown’ machine learning algorithm, i.e. not using the inbuilt classifiers of sklearn
  • Wade through ISL/ESL

That concludes the obligatory first post, time to get started on classes, next post will no doubt be about that.