Writing a Machine Learning Classifier: K-Nearest Neighbors

Machine learning is a subset of Artificial Intelligence. Despite the elusive title, machine learning simply refers to developing methods for computers to learn from new information and make predictions.

Imagine you wish to write a program that can identify the species of a flower based on a few measurements. You could write a series of if-statements that guide a computer to classify the flower’s species. However, there are two key issues with this approach.

First, if a human is explicitly writing the rules for which the computer classifies the flower’s species, there is likely to be bias induced from a human’s inability to understand all of the data required to classify a flower. Second, if new data is introduced, the rules (if-statements) must be rewritten, taking valuable time.

For these reasons, a new solution is needed. This solution must adapt to new data, require no explicit writing of rules, and be computationally efficient. In other words, we need a program that can learn.

K-Nearest Neighbors

When discussing machine learning, there is a myriad of methods and models to choose from. Some of these models blur the lines of classical statistics including forms of regression while others replicate the structure of the human brain using neurons.

To solve our classification problem, we will be using a model titled K-Nearest Neighbors. This model, as the name suggests, uses the assumption that if a new data point is added to a model, it is likely that the new data point is of the same type as it’s nearest already classified neighbor.

A visual representation of a K-Nearest Neighbor Classifier

In the example above, the x-axis denotes a flower’s petal width while the y-axis denotes the petal’s length. You can see that blue flowers have smaller petal lengths than red flowers but larger petal widths (and vice versa). Let’s say you add a new point (shown in yellow). What type of flower is the yellow point? Red or blue?

According to the model, it is a red flower. This is because it is physically closest to a data point that is already classified as red.

Writing a KNN Classifier

Using Python 3 and Jupyter Notebooks, I have written my own KNN Classifier. While pre-made KNN classifiers are widely available, writing your own provides greater knowledge of how the model works. I begin by importing the necessary Python packages for this program.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%xmode Minimal
%matplotlib inline

Once the packages are imported, I load a famous machine learning dataset called Iris. It includes four columns which provide measurements about 150 Iris flowers for the model to learn and a fifth column that includes the classification of the flower.

path = 'iris.csv'
testing_size = .2
data = pd.read_csv(path, header = None)

From here, I determine the classifications stored as text in the Pandas DataFrame. In the following code, I create a dictionary of the unique values with their associated number and apply the classification to the DataFrame column.

classes = pd.Series(data.iloc[:,-1].unique())
i = 1
dictionary = {}
for item in classes:
    dictionary.update( {item : i} )
    i = i + 1
data = data.replace(dictionary)

Next, I convert the DataFrame into a NumPy array and randomly shuffle the array. I also slice the resulting array into 80% training data (for the model to learn) and 20% testing data (to test how accurate the model is).

array = np.array(data)
test_num = round((np.size(array,0))*testing_size)
test = array[:(test_num),:]
train = array[(test_num + 1):,:] 

After preparing the data, I use NumPy array broadcasting to determine the distances of each training data point from each testing data point. I then use NumPy functions to locate the index of the closest point from each training point and capture the classification it predicted.

input_array = test[:,:-1]
x = []
for row in input_array:
    distances = (((row - train[:,:-1])**2)**.5).sum(axis=1)
    min_index = np.argmin(distances)
    classification = train[min_index,-1]
    predict = np.array(x)[:,np.newaxis]

Finally, I combine the predicted classification with the actual classification of the testing data to determine the accuracy of the model.

output = np.hstack((test,predict))
correct = np.count_nonzero(output[:,-1] == output[:,-2])
total = np.size(output[:,-1])
accuracy = round(((correct/total)*100),2)

The classifier takes (on average) 1 millisecond to run through the data and the model is always 90% accurate (or greater depending on how the NumPy array was randomized).

While the model developed here is not nearly as optimized as Scikit-Learn (especially for larger datasets), it does provide insight as to how a relatively simple classification model is developed and tested. More importantly, it reveals that machine learning, while very clever, is nothing more than mathematics.

National Planning Conference 2018: My Experience

In the fall of 2017, I was presented with the opportunity to complete a major project in lieu of several class assignments.

After personally witnessing the decline of several shopping malls in north Dallas, I wanted to focus on how mixed-use developments can help revitalize areas previously occupied by shopping malls.

I selected a Collin Creek Mall in east Plano, Texas for my project. This mall, built in the early 1980s, has lost foot traffic and shop leases in recent years.

Using the existing property lines, I designed a mixed-use development that includes commercial and residential spaces. The following is an aerial view of my design:

An aerial view of the proposed development.

At the end of the semester, I presented the design to my classmates. It was around this time that my professor suggested that I submit my project to the 2018 National Planning Conference.

A closer look at the River Side Community.

I submitted the project and was accepted as the only undergraduate student from Texas A&M to present in at the conference in New Orleans! The school also agreed to reimburse me for my conference expenses.

By now, it was February and I had until April of 2018 to prepare my presentation. By April, I had my poster printed and ready for the conference.

When I arrived at the conference, vendors and other speakers began to set up their booths. I found a spot for my poster among graduate students from Texas A&M and other schools.

Presenting my poster at NPC18.

I presented my poster to economists, urban planners, data scientists, and real estate professionals throughout the day. I spent my time answering questions about why I chose Collin Creek Mall, the software I used to create the models, and different aspects of the design.

Overall, presenting at NPC18 was a great experience. I had the opportunity to speak with well established professionals and share my work over the past two semesters, an invaluable exercise.