American Wealth Moves North and West

At first glance, you may think this title is referring to northwestern US states like Oregon or Idaho. While there certainly are wealthy areas in the northwestern US, I am actually referring to which parts of a given city are wealthy.

After traveling across and living in multiple parts of the United States, I have noticed that cities tend to be wealthier on their northern halves. Until now, this was just conjecture but I took the opportunity to utilize publicly available census tract data to investigate my suspicions.


Building the Visual

First, I obtained data from various public data sources. This includes census tract shapefiles, income data, and census tract to county MSA conversions.

I then selected a range of MSAs to analyze. In all I looked at Atlanta, Austin, Boston, Chicago, Dallas, Denver, Houston, Indianapolis, Kansas City, Las Vegas, Los Angeles, Miami, Milwaukee, Minneapolis, Nashville, New Orleans, New York, Oklahoma City, Orlando, Philadelphia, Phoenix, Portland, Salt Lake City, San Antonio, San Francisco, Seattle, Tampa, and Washington DC.

From there, I standardized the latitude and longitude of each MSA such that the most southwestern point in an MSA would have a coordinate of (0,0) while the most northeastern point would have a coordinate of (1,1). This controls for physical size differences between MSAs.

Lastly, I scaled the income of each census tract such that the tract with the highest income in an MSA has an income value of 1 and the lowest income tract has a value of 0. This also controls for wealth differences between MSAs.

I used this dataset to layer all of the MSA data to create a supercity that represents all of the individual MSAs collectively.

And here is the result! The closer to gold a given tract is the higher its income. Conversely, the closer to dark blue a tract is the lower its income. The black dot represents the city center. I observe a fairly clear distinction between the northwest and southeast of US cities.

There are, of course, exceptions to the rule. We can see gold census tracts in the south of some MSAs though wealth generally appears to be concentrated in the northwest.


A Simple Explanatory Model

To add some validity to these findings I estimated a very simple linear model which estimates a census tract’s income using its relative position to the city center. Here are the results:

TermCoefficient (Converted to USD)
Intercept$84,288
Longitude (West/East)-$6,963
Latitude (North/South) $7,674
Results of income prediction model

The way to read these coefficients is as follows. At the city center census tracts have, on average, a median household income of $84,288. As you move east median household income falls (hence the negative coefficient for Longitude) and as you north income rises (hence the positive coefficient for Latitude).

In other words, northwestern tracts have median household incomes approximately $14,000 wealthier than the city center or $28,000 wealthier than their southeastern counterparts.

Obviously, this model is oversimplified and would not be a good predictor of tract income given the huge variety of incomes across MSAs in the US, but it does illustrate an interesting point about income vs. tract position in an MSA.


Closing Thoughts

Before closing out, I wanted to draw attention to a few specific MSAs where this effect is particularly pronounced. I would argue that this northwest vs southeast impact is pronounced in the following six cities, especially Washington DC.

I hope this high level summary provides some interesting food for thought about the differences in income across US cities.

Kahan’s Summation Algorithm: Computing Better Sums

Suppose you have the following list of numbers in Python and you would like to compute the sum. You use the sum() function and expect it to return 0.3. Yet, when you run the code the console returns a value very slightly above 0.3:

numbers = [0.1, 0.1, 0.1]

sum(numbers)

0.30000000000000004

You can round this number of course, but it begs the question as to why the correct sum was not returned in the first place. Enter the IEEE 754 floating point standard.

Floating Point Storage

The double type is a 64 binary digit (bit) numerical storage standard that includes 1 sign bit (determines if number is positive or negative), a 53 bit significand (only 52 are stored for non-zero values), and an 11 bit exponent.

An 11 bit exponent means the smallest positive number that can be stored is 2-1022. Additionally, the largest rounding error possible in this standard is 2-52 called machine epsilon. Because this is a binary representation that means numbers that can be represented exactly in base 10 must be approximated when converting to binary.

Going back to our example above, 0.1 is a value that must be rounded for storage in this format. This is because 0.1 in binary is infinite:

0.000110011001100110011...

There are methods to store values exactly but this comes at the speed of computation. What if we want to keep the speed of 64 bit computation but reduce our error, specifically for large number series?

The Algorithm

Enter Kahan’s Summation Algorithm. Developed by William Kahan, this summation methodology allows for more accurate summation using the double storage format. Here is a simple Python implementation:

def kahan_sum(x):
   
  sum = 0.0; c = 0.0
 
  for i in x:
      
    y = i - c
    t = sum + y
    c = t - sum - y
    sum = t
 
  return sum

Okay, so this looks pretty simple. But what do each of the pieces mean? The first two lines establish a function in Python while setting the starting sum and starting error to 0.0:

def kahan_sum(x):
   
  sum = 0.0; c = 0.0

The next few lines are the for loop that iterates over each number in the list. First, any error is subtracted from the previous iteration.

y = i - c

Second, the new number is added to the running total minus any error.

t = sum + y

Third, error from this new addition is determined and the new total is assigned. This repeats until there are no more numbers.

c = t - sum - y
sum = t

A Practical Example

Okay, so the code is pretty simple but how does this work in practice? Suppose we have a list of two numbers:

[1.0, 1.0]

Step 1

The sum and error terms are set to 0.0 when the algorithm is started. The first step of each iteration is to take the current value and subtract any error from the previous iteration. Because the starting error is 0.0, we subtract 0.0 from the first value.

1.0 - 0.0 = 1.0

Step 2

Next we add the result of the previous operation to the total. Again, the initial total is 0.0 so we just add 0.0 to the value from Step 1 (1.0). Oh no! The computer had to make a rounding error. In this case, the computer was off by 0.1. We can handle this error in the next steps.

0.0 + 1.0 ~ 1.1

Step 3

In this step we determine the error from Step 2. We take the sum from Step 2 (1.1), subtract the total (0.0), and subtract the total from Step 1 (1.0). This leaves us with the approximate error.

1.1 - 0.0 - 1.0 ~ 0.1

Step 4

Finally, we record the current total for the next iteration!

1.1

And Repeat!

Now we repeat Steps 1, 2, 3, and 4 for each additional number. The difference this time is that we have non-zero values for the error and total terms. First, we subtract the error term from the last iteration to the new value:

1.0 - 0.1 = 0.9

Next, add the new value to the previous total:

1.1 + 0.9 = 2.0

Next, take the sum from the last step and subtract the previous iteration’s total and the value from the first step to estimate any error. In this case there is no error so we record a value of 0.0 for the error going into the next iteration:

2.0 - 1.1 - 0.9 = 0.0

Finally, return the sum. We can see that even though the computer made an error of 0.1, the algorithm corrected itself and returned the correct value:

2.0

Final Thoughts

Kahan’s method of summation strikes a balance between the speed of floating point arithmetic and accuracy. Hopefully this walkthrough makes the algorithm more approachable.

Association Rule Mining in R

Association rule mining is the process of determining conditional probabilities within events that contain items or characteristics. Events can range from tweets, to grocery store receipts, to credit card applications.

Items within these events should also not be unique to each event. For example, words are repeated across tweets, multiple customers will buy the same items at the grocery store, and credit card applicants will share specific characterisitcs.

For all of these applications our goal is to estimate the probability that an event will possess item B given that it has item A. This probability is also called the confidence.

In the example above we might say that we are 23% confident that a customer will purchase rice (item B) given they are purchasing chicken (item A). We can use historical transactions (events) to estimate confidence.

Now for a practical implementation using the tidyverse in R! I am using a groceries dataset from Georgia Tech. This dataset contains rows with items separated by commas.

receipt
citrus fruit, semi-finished bread
ready soups, margarine
One transaction per row with items comma separated.

Because each event contains different items I read it using readLines() and reshape into a longer format. The groceries column contains the item name while transaction contains the transaction ID.

link <- "https://cse6040.gatech.edu/datasets/groceries.csv"

groceries <- readLines(link)

# Create long form version of data

groceries_long <- 
  data.frame(groceries) %>%
  dplyr::mutate(
    transaction = dplyr::row_number()
  ) %>%
  tidyr::separate_rows(
    groceries, sep = ","
  )
groceriestransaction
citrus fruit1
semi-finished bread1
tropical fruit2
Long form data with one item per row with a transaction ID.

With our data in the proper format we can develop two functions. The first function takes a vector of items and returns a vector of comma separated combinations as (A,B) and (B,A).

comb_vec <- function(items) {
  
  p <- t(combn(items, 2))
  
  c(paste0(p[, 1], ",", p[, 2]), paste0(p[, 2], ",", p[, 1]))

}

For example, giving this function c("A", "B", "C") would return c("A,B" "A,C" "B,C" "B,A" "C,A" "C,B"). This is because we want to determine the probabilities of A given B and B given A.

Our final function performs the data mining. The first argument called data takes in the data frame of events and items. The last two arguments item_col and event_id tell the function which columns refer to the items and the event identifier respectively.

pair_assoc <- function(data, item_col, event_id, item_min = 1L) {
  
  # Count all items
  
  item_count <- dplyr::count(data, !!sym(item_col), name = "A Count")
  
  data %>%
    dplyr::group_by( # Group by event identifier
      !!sym(event_id)
    ) %>%
    dplyr::filter( # Ensure event contains at least one item
      length(!!sym(item_col)) > 1
    ) %>%
    dplyr::summarise( # Create combinations for each event
      comb = comb_vec(!!sym(item_col))
    ) %>%
    dplyr::ungroup( # Ungroup before counting combinations
    ) %>%
    dplyr::count( # Count combinations across all events
      comb, name = "A B Count"
    ) %>%
    tidyr::separate( # Separate combinations into two columns
      col = comb,
      into = c("A","B"), sep = ","
    ) %>%
    dplyr::left_join( # Join counts of item A from item_count
      y = item_count,
      by = c("A" = item_col)
    ) %>%
    dplyr::mutate( # Compute confidence P(B given A)
      Confidence = `A B Count` / `A Count`
    ) %>%
    dplyr::arrange( # Descend by confidence
      desc(Confidence)
    )
  
}

This function works in two stages. First, it determines the count of all individual items in the data set. In the example with groceries, this might be the counts of transactions with rice, beans, etc.

groceriesA Count
baking powder174
berries327
Counts of individual items serve as the denominator in the confidence computation.

The second stage uses the comb_vec() function to determine all valid item combinations within each group. This stage only returns valid combinations where the confidence is > 0%.

Finally, the function left joins the item counts to the combination counts and computes the confidence values. I called the function and return the result. I am also filtering to only combinations with a confidence of 50% or more with items purchased more than 10 times.

groceries_long %>%
  pair_assoc(
    item_col = "groceries", 
    event_col = "transaction"
  ) %>%
  dplyr::filter(
    `A Count` >= 10,
    Confidence >= 0.5
  )

Here we can see the head of the results table ordered by confidence from highest to lowest. We observe that the confidence of honey and whole milk is 73%! In other words, 73% of the transactions that contain honey also contain whole milk.

ABA B CountA CountConfidence
honeywhole milk11140.733
frozen fruitsother vegetables8120.667
cerealswhole milk36560.643
ricewhole milk46750.613
Head of results table.

Association rule mining is a fairly simple and easy to interpret technique to help draw relationships between items and events in a data set.

The Logistic Map: Visualizing Chaos in R

In the 1970s, professor Robert May became interested in the relationship between complexity and stability in animal populations. He noted that even simple equations used to model populations over time can lead to chaotic outcomes. The most famous of these equations is as follows:

xn+1 = rxn(1 – xn)

xn is a number between 0 and 1 that refers to the ratio of the existing population to the maximum possible population. Additionally, r refers to a value between 0 and 4 which indicates the growth rate over time. xn is multiplied by the r value to simulate growth where (1 – xn) represents death in the population.

Lets assume a population of animals is at 50% of the maximum population for a given area. We would allow xn to be .5. Lets also assume a growth rate of 75% allowing r to be .75. After the value xn+1 is computed, we use that new value as the xn in the next iteration and continue to use an r value of .75. We can visualize how xn+1 changes over time.

Visualizing the population with an r value of 50% and a starting population of 50%.

Within 20 iterations, the population dies off. Lets rerun the simulation with an r value greater than 1.

Visualizing the population with an r value of 1.25 and a starting population of 50%.

Notice how the population stabilizes at 20% of the area capacity. When the r value is higher than 3, the population with begin oscillating between multiple values.

Visualizing the population with an r value of 3 and a starting population of 50%.

Expanding beyond an r value of 3.54409 yields rapid changes in oscillation and reveals chaotic behavior.

Visualizing the population with an r value of 3.7 and a starting population of 50%.

Extremely minor changes in the r value yield vastly different distributions of population oscillations. Rather than experiment with different r values, we can visualize the distribution of xn+1 values for a range of r values using the R programming language.

Lets start by building a function in R that returns the first 1000 iterations of xn+1 for a given r value.

logistic_sim <- function(lamda, starting_x) {
  
  vec <- c(starting_x)
  iter <- seq(1,1000,1)
  
  for (i in iter) { 
    vec[(i + 1)] <- vec[i] * lamda * (1 - vec[i])
  }
  
  vec <- vec[1:(length(vec) - 1)]
  
  data.frame(vals = vec, lamda = lamda, iter = iter)

}

This function returns a dataframe with three columns: the iteration number, the r used for each iteration, and the xn+1 value computed for that iteration.

Now we need to iterate this function over a range of r values. Using purrr::map_dfr we can row bind each iteration of r together into a final dataframe.

build_data <- function(min, max) {
  
  step <- (max - min) / 400
  
  purrr::map_dfr(
    seq(min,max,step),
    logistic_sim
  )
  
}

Min refers to the lower limit of r while the max refers to the upper limit. The function will return a dataframe of approximately 400,000 values referring to each of the 1000 iterations for the 400 r values between the lower and upper bound. The function returns all 400,000 values in less than a quarter of a second.

With the dataframe of values assembled, we can visualize the distribution of values using ggplot.

data %>%
  dplyr::filter(
    iter > 50
 # Filtering out the first iterations to allow the simulation to stabilize
  ) %>%
  ggplot(
    aes(x = lamda, y = vals, color = lamda)
  ) +
  geom_point(
    size = .5
  ) +
  labs(
    x = "Growth Rate",
    y = "Population Capacity",
    title = "Testing Logistic Growth in Chaos"
  ) +
  scale_x_continuous(
    labels = scales::percent
  ) +
  scale_y_continuous(
    labels = scales::percent
  ) +
  theme_minimal(
  ) +
  theme(
    legend.position = "none",
    text = element_text(size = 25)
  )
Visualizing the distribution of 400 r values between 0 to 4 for 1000 iterations.

Notice how r values of less than 1 indicate the population dies out. Between 1 and just under three, the population remains relatively stable. At around 3, the populations being oscillating between two points. Beyond an r of 3.54409, chaos ensues. It becomes extremely difficult to predict the value of xn+1 for a given iteration with an r value above 3.54409. So difficult, in fact, that this simple deterministic equation was used as an early random number generator.

So what are the practical applications for this? Representations of chaos (or systems that yield unpredictable results and are sensitive to starting conditions) can be seen across many industries and fields of study. In finance, for example, intra-day security prices have been described as a random walk – extremely difficult to predict. While long term outlooks may show seasonality, chaos theory can help model the extremely chaotic and unpredictable nature of stock prices.

Zillow: Taking a Look at the Real Estate Marketing Giant

Zillow provides prospective homebuyers and renters with a toolkit to search for their next residence. Users can filter by price, number of bedrooms/bathrooms, home type, square footage, lot size, and gain access to Zillow’s famous Zestimate – all for free. Even better, homeowners, agents, and rental property managers can list residencies for free as well.

How is it, with so many free services, that Zillow earns over $1 billion in annual revenue and employs thousands of people? As is the case with many online media platforms, it is primarily through the sale of ads.

Much like ads on Google search pages, Zillow allows advertisers to reach specific audiences through advertisements on their platform. Advertisers range from rental agencies to interior designers – all people who need to reach subsets of prospective renters and buyers.

An ad from an internet service provider in Texas

Zillow maintains other revenue streams too. In combination with ads on Zillow’s website, Zillow advertisers can take run ads on other sites in the Zillow Rental Network. This network of websites includes Trulia, Hotpads, and even AOL Real Estate.

Beyond advertising, Zillow has started Zillow Offers; This program provides cash offers for homeowners who wish to expedite the sale of their home. Zillow connects sellers with a representative who helps them work through the sale and closing of their home.

Zillow also offers real estate agents the opportunity to become a Premiere Agent. Premiere Agents pay for Zillow to provide additional services to help run their agency. Premiere Agents can receive leads from Zillow’s network of sites and Facebook. They can also gain access to a CRM and even a website for their agency.

Zillow is owned by the Zillow Group, a company which oversees Zillow, Trulia, Hotpads, Zillow Home Loans, and a number of other organizations. With such a broad network of websites, services, and users, its no wonder that Zillow attracts people from all over the housing industry.

NAHB: Data Staging in Python

The National Association of Home Builders or NAHB is an organization that represents housing interests within the United States. This includes affordable housing, home building techniques and methods, and the promotion of home ownership.

In conjunction with these goals and functions, NAHB releases data about housing in the United States. The data tracks various metrics including Median Income, Interest Rates, and NAHB’s very own Housing Opportunity Index or HOI.

NAHB calculates the HOI by looking comparing home prices to income in a given area. For example, if income rises in Dallas-Ft. Worth and home prices remain the same, then HOI increases. Alternatively, if income remains the same while home prices rise, HOI falls.

I wanted to visualize this dataset in Tableau, however, when I opened the spreadsheet, it was in a format that is incompatible with Tableau.

The raw NAHB spreadsheet.

While the format is acceptable for basic spreadsheet analysis, it lacks the proper long-form layout required for analysis in Tableau. Using Python, I wanted to convert this spreadsheet into a CSV with the following features:

  • Variable name column headers (Median Income, FIP, etc.)
  • One row per quarter per metropolitan statistical area
  • Proper datetime and numerical formats

My Python script begins by importing the proper dependencies. In this instance, I need pandas, numpy, and datetime.

import pandas as pd
import numpy as np
import datetime as dt

Next, I use Pandas to read the Excel file, remove unneeded rows, and melt the pivoted table format. From there, I renamed two columns for better readability.

df = pd.read_excel('housingdata.xls')
df = df.loc[df.NAME.notnull()]

df1 = df[~df['flag'].isin([1,8])]

dfmelt = pd.melt(df1, id_vars=['msa_fip','NAME','flag'])

dfmelt.rename(columns={'variable':'date',
                       'NAME':'variable'},
              inplace=True)

I use the melted dataset to create a new column called index which is used to pivot the data in a format that is readable by Tableau.

dfmelt['msa_fip'] = dfmelt['msa_fip'].apply(str)

dfmelt['index'] = dfmelt['msa_fip'] + dfmelt['date']

dfpivot = dfmelt.pivot(index='index', 
                       columns='variable', 
                       values='value').reset_index()

Next, I create new columns from slices of existing columns to create a proper datetime column.

dfpivot['FIP'] = dfpivot['index'].str[:5]
dfpivot['Quarter'] = dfpivot['index'].str[6:7]
dfpivot['Year'] = dfpivot['index'].str[8:]

dfpivot['Year'] = dfpivot['Year'].apply(int)

dfpivot['Year'] = np.where(dfpivot['Year']>80,
                           1900 + dfpivot['Year'],
                           2000 + dfpivot['Year'])

dfpivot['Year'] = dfpivot['Year'].apply(str)

dfpivot['Quarter'] = dfpivot['Quarter'].replace({'1':'01-01',
                                              '2':'03-01',
                                              '3':'06-01',
                                              '4':'09-01'})

dfpivot['Date'] = (dfpivot['Year'] + '-' + dfpivot['Quarter'])

dfpivot['Date'] = pd.to_datetime(dfpivot['Date'])

Lastly, I collect the unique names of all metropolitan statistical areas from the initial dataframe. I left join this new smaller dataset called names to the newly formatted dataset.

names = df[['NAME','msa_fip']].loc[df.flag == 1].drop_duplicates()
names['msa_fip'] = names['msa_fip'].apply(str)
dfpivot['FIP'] = dfpivot['FIP'].apply(str)
names.set_index('msa_fip', inplace=True)
dfpivot.set_index('FIP', inplace=True)

output = (dfpivot.join(names,
                       how='left')).reset_index()

output['FIP'] = output['index']
output.drop(columns='index', inplace=True)

This final dataframe can now be saved as a CSV or Excel file for further analysis in Tableau.

The NAHB data after staging it in Python.

Now that the data has been staged and saved as a CSV, we can conduct deeper analysis. Using Tableau Public, I created two visualizations about the Housing Opportunity Index.

The first visualization highlights changes to the index across the country on average for all metropolitan-statistical-areas.

A time series plot of the HOI over time.

The second visualization is a scatterplot that compares the median home price in a metropolitan-statistical-area to the HOI in that area. As one may suspect, home prices inversely correlate with housing opportunity. In other words, greater affordability improves housing opportunity.

A scatterplot comparing the HOI to median home prices.

Implementing Fuzzy Matching in Python

Text is all around us; essays, articles, legal documents, text messages, and news headlines are consistently present in our daily lives. This abundance of text provides ample opportunities to analyze unstructured data.


Imagine you are playing a game where someone hands you a index card with the misspelled name of a popular musician. In addition, you have a book containing correctly spelled names of popular musicians. The goal is for you to return the correct spelling of the misspelled name.

In this example, suppose someone hands you a card with “Billie Jole” written on it. You quickly open the book of musicians, find names beginning with the letter B, and find the name “Billy Joel.”

As a human, this was easy for you to complete, but what if you wanted to automate this task? This can be done using Fuzzy Logic, or more specifically, the Levenshtein distance.


The Levenshtein distance considers two pieces of text and determines the minimum number of changes required to convert one string into another. You can utilize this logic to find the closest match to any given piece of text.

I am going to focus on implementing the mechanics of finding a Levenshtein distance in Python rather than the math that makes it possible. There are many resources on YouTube which explain how the Levenshtein distance is calculated.


First, import numpy and define a function. I called the function lv as shorthand for Levenshtein distance. Our function requires two input strings which are used to create a 2D matrix that is one greater than the length of each string.

def ld(s1, s2):    
    rows = len(s1)+1
    cols = len(s2)+1
    dist = np.zeros([rows,cols])

If you were to use the strings “pear” and “peach” in this instance, the function should create a 5 by 6 matrix filled with zeros.

A matrix of zeros.

Next, the first row and column need to count up from zero. Using for loops, we can iterate over the selected values. Our Python function now creates the following matrix.

def ld(s1, s2):
    rows = len(s1)+1
    cols = len(s2)+1
    dist = np.zeros([rows,cols])
    
    for i in range(1, rows):
        dist[i][0] = i
    for i in range(1, cols):
        dist[0][i] = i
A matrix set up for finding the Levenshtein distance.

Finally, we need to iterate over every column and row combination. By doing this, we can find the minimum value of the cells directly above, to the left, and above to the left of each cell. After the minimum is found, our Python script adds one to this value to the location in question.

def ld(s1, s2):
    rows = len(s1)+1
    cols = len(s2)+1
    dist = np.zeros([rows,cols])
    
    for i in range(1, rows):
        dist[i][0] = i
    for i in range(1, cols):
        dist[0][i] = i
        
    for col in range(1, cols):
        for row in range(1, rows):
            if s1[row-1] == s2[col-1]:
                cost = 0
            else:
                cost = 1
            dist[row][col] = min(dist[row-1][col] + 1,
                                 dist[row][col-1] + 1,      
                                 dist[row-1][col-1] + cost)
            return dist[-1][-1]

Our matrix should now look like the following with the far bottom right cell representing the number of changes required to convert one string into another. In this instance, it requires 2 changes to convert “peach” into “pear”; deleting the letter “c” in “peach” and replacing the letter “h” with the letter “r”.

A completed Levenshtein distance matrix. The bottom right number (in gold) represents the number of changes required.

What is so great about this function is that it is adaptable and will accept a string of any length to compute the number of changes required. While the mechanics behind this function are relatively simple, its use cases are vast.

Writing a Machine Learning Classifier: K-Nearest Neighbors

Machine learning is a subset of Artificial Intelligence. Despite the elusive title, machine learning simply refers to developing methods for computers to learn from new information and make predictions.

Imagine you wish to write a program that can identify the species of a flower based on a few measurements. You could write a series of if-statements that guide a computer to classify the flower’s species. However, there are two key issues with this approach.

First, if a human is explicitly writing the rules for which the computer classifies the flower’s species, there is likely to be bias induced from a human’s inability to understand all of the data required to classify a flower. Second, if new data is introduced, the rules (if-statements) must be rewritten, taking valuable time.

For these reasons, a new solution is needed. This solution must adapt to new data, require no explicit writing of rules, and be computationally efficient. In other words, we need a program that can learn.

K-Nearest Neighbors

When discussing machine learning, there is a myriad of methods and models to choose from. Some of these models blur the lines of classical statistics including forms of regression while others replicate the structure of the human brain using neurons.

To solve our classification problem, we will be using a model titled K-Nearest Neighbors. This model, as the name suggests, uses the assumption that if a new data point is added to a model, it is likely that the new data point is of the same type as it’s nearest already classified neighbor.

A visual representation of a K-Nearest Neighbor Classifier

In the example above, the x-axis denotes a flower’s petal width while the y-axis denotes the petal’s length. You can see that blue flowers have smaller petal lengths than red flowers but larger petal widths (and vice versa). Let’s say you add a new point (shown in yellow). What type of flower is the yellow point? Red or blue?

According to the model, it is a red flower. This is because it is physically closest to a data point that is already classified as red.

Writing a KNN Classifier

Using Python 3 and Jupyter Notebooks, I have written my own KNN Classifier. While pre-made KNN classifiers are widely available, writing your own provides greater knowledge of how the model works. I begin by importing the necessary Python packages for this program.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%xmode Minimal
%matplotlib inline

Once the packages are imported, I load a famous machine learning dataset called Iris. It includes four columns which provide measurements about 150 Iris flowers for the model to learn and a fifth column that includes the classification of the flower.

path = 'iris.csv'
testing_size = .2
data = pd.read_csv(path, header = None)

From here, I determine the classifications stored as text in the Pandas DataFrame. In the following code, I create a dictionary of the unique values with their associated number and apply the classification to the DataFrame column.

classes = pd.Series(data.iloc[:,-1].unique())
i = 1
dictionary = {}
for item in classes:
    dictionary.update( {item : i} )
    i = i + 1
data = data.replace(dictionary)

Next, I convert the DataFrame into a NumPy array and randomly shuffle the array. I also slice the resulting array into 80% training data (for the model to learn) and 20% testing data (to test how accurate the model is).

array = np.array(data)
np.random.shuffle(array)
test_num = round((np.size(array,0))*testing_size)
test = array[:(test_num),:]
train = array[(test_num + 1):,:] 

After preparing the data, I use NumPy array broadcasting to determine the distances of each training data point from each testing data point. I then use NumPy functions to locate the index of the closest point from each training point and capture the classification it predicted.

input_array = test[:,:-1]
x = []
for row in input_array:
    distances = (((row - train[:,:-1])**2)**.5).sum(axis=1)
    min_index = np.argmin(distances)
    classification = train[min_index,-1]
    x.append(classification)
    predict = np.array(x)[:,np.newaxis]

Finally, I combine the predicted classification with the actual classification of the testing data to determine the accuracy of the model.

output = np.hstack((test,predict))
correct = np.count_nonzero(output[:,-1] == output[:,-2])
total = np.size(output[:,-1])
accuracy = round(((correct/total)*100),2)

The classifier takes (on average) 1 millisecond to run through the data and the model is always 90% accurate (or greater depending on how the NumPy array was randomized).

While the model developed here is not nearly as optimized as Scikit-Learn (especially for larger datasets), it does provide insight as to how a relatively simple classification model is developed and tested. More importantly, it reveals that machine learning, while very clever, is nothing more than mathematics.

Visualizing Mortgage Loan Originations in the United States

Every year, the Federal National Mortgage Mortgage Association, commonly known as Fannie Mae (traded as FNMA), provides financial backing to thousands of mortgage lenders across the United States.

In an effort to promote home ownership, Fannie Mae expands the secondary mortgage market by purchasing mortgage loans from lenders and packaging them into Mortgage Backed Securities (MBSs).

This process allows mortgage lenders to reinvest their assets and originate more mortgages. This effectively increases the number of mortgage lenders in the United States by reducing reliance on lender reserves.

Fannie Mae publishes aggregated data on the mortgage loans they purchase using a tool called Data Dynamics.

Using Data Dynamics and Tableau, I located and visualized data on single-family home mortgages originated to serve as a primary residence.

This first visualization shows the count of mortgage loans purchased by Fannie Mae broken down by credit score.

Credit Range Color Key

This key applies to all subsequent data visualizations.

In the early 2000s, a wide variety of credit score ranges were represented across millions of mortgage loans. Leading up to 2008, the total number of loan purchases decreased.

Following the Great Recession, lending to low credit score individuals decreased with a slow increase to the 620-660 credit range starting in 2010.

Mortgage loan count by year broke down by credit score.

I wanted to take a closer look at the percentage of mortgage loans from each credit score range regardless of the count.

This highlights the diverse credit score ranges that were accepted in the early 2000s, tightening percentages on low scores near 2008, and a steady increase of credit score diversity into 2016.

Percent share of mortgage loans by credit score.

Using Loan to Value (LTV) data from Data Dynamics, I was able to calculate an estimated average down payment percentage on mortgages Fannie Mae purchased for each year.

Estimated Down Payment % = (1 – LTV) * 100

The average down payment, according to Data Dynamics, steadily increased across all credit ranges until 2006. From 2006 to 2008, the average down payment fell from 23% to 21%.

During the recession, down payments increased to 25% and have been declining since 2010.

The average down payment on Fannie Mae mortgages for each year.

Data Dynamics also provides data on borrower’s Debt to Income Ratios (DTIs). Using a box and whisker plot, I was able to visualize the average DTIs for each credit score range.

As expected, borrowers with lower credit scores (0-620 and 620-660) had the highest DTI’s. Borrowers with the highest credit scores (780+) always had the lowest DTIs.

A box and whisker plot that shows yearly DTI trends for each credit score.

Every day, Fannie Mae helps Americans achieve the dream of homeownership. Through innovative financing solutions, Americans are able to build equity in their homes and live more enriched lives.

Tableau: Visualizing Mortgages and Education

What is Tableau?

From business intelligence to academic research, Tableau is a leader in the world of data – and rightfully so.

In recent years Tableau has released a free version of Tableau called Tableau Public. This free version provides all of the same wonderful visualization tools as paid variants with a few key drawbacks.

  1. You can only connect to several data sources (No SQL databases)
  2. You cannot save Tableau projects locally
  3. Your projects must be published on Tableau’s website

Despite these drawbacks, Tableau Public is a great way to start using Tableau software.

In this example, I am using Tableau Public to visualize the percentage of homes with a mortgage and the percentage of people with a bachelor’s degree in the state of Texas.

All visualizations draw data from the 2017 American Community Survey.

Visualize the Data

Firstly, let’s look at the percentage of homes with a mortgage across census tracts in Texas.

Percentage of Homes with a Mortgage


The percentage of homes with a mortgage in Texas census tracts

Areas with high percentages of homes with a mortgage are concentrated in specific areas such as north Dallas-Ft. Worth, Austin, and west Houston.

Now let’s look at a similar map that highlights the percentage of people with a bachelor’s degree.

Percentage of People with a Bachelor’s Degree

The percentage of people with a bachelor’s degree in Texas census tracts.

With few exceptions, the two maps display similar concentrations around metropolitan areas.

Finally, let’s look at a scatter plot that compares the two variables in one visualization.

Percent with Mortgage vs. Percent with Bachelor’s

A scatter plot that compares mortgages and bachelor’s degrees in Texas.

The regression line found in this scatter plot was determined to have an R-squared value of .57. As a result, there is some correlation between the two variables.

Overall, this simple exploratory data analysis project just scratches the surface of Tableau’s capabilities.