NAHB: Data Staging in Python

The National Association of Home Builders or NAHB is an organization that represents housing interests within the United States. This includes affordable housing, home building techniques and methods, and the promotion of home ownership.

In conjunction with these goals and functions, NAHB releases data about housing in the United States. The data tracks various metrics including Median Income, Interest Rates, and NAHB’s very own Housing Opportunity Index or HOI.

NAHB calculates the HOI by looking comparing home prices to income in a given area. For example, if income rises in Dallas-Ft. Worth and home prices remain the same, then HOI increases. Alternatively, if income remains the same while home prices rise, HOI falls.

I wanted to visualize this dataset in Tableau, however, when I opened the spreadsheet, it was in a format that is incompatible with Tableau.

The raw NAHB spreadsheet.

While the format is acceptable for basic spreadsheet analysis, it lacks the proper long-form layout required for analysis in Tableau. Using Python, I wanted to convert this spreadsheet into a CSV with the following features:

  • Variable name column headers (Median Income, FIP, etc.)
  • One row per quarter per metropolitan statistical area
  • Proper datetime and numerical formats

My Python script begins by importing the proper dependencies. In this instance, I need pandas, numpy, and datetime.

import pandas as pd
import numpy as np
import datetime as dt

Next, I use Pandas to read the Excel file, remove unneeded rows, and melt the pivoted table format. From there, I renamed two columns for better readability.

df = pd.read_excel('housingdata.xls')
df = df.loc[df.NAME.notnull()]

df1 = df[~df['flag'].isin([1,8])]

dfmelt = pd.melt(df1, id_vars=['msa_fip','NAME','flag'])

dfmelt.rename(columns={'variable':'date',
                       'NAME':'variable'},
              inplace=True)

I use the melted dataset to create a new column called index which is used to pivot the data in a format that is readable by Tableau.

dfmelt['msa_fip'] = dfmelt['msa_fip'].apply(str)

dfmelt['index'] = dfmelt['msa_fip'] + dfmelt['date']

dfpivot = dfmelt.pivot(index='index', 
                       columns='variable', 
                       values='value').reset_index()

Next, I create new columns from slices of existing columns to create a proper datetime column.

dfpivot['FIP'] = dfpivot['index'].str[:5]
dfpivot['Quarter'] = dfpivot['index'].str[6:7]
dfpivot['Year'] = dfpivot['index'].str[8:]

dfpivot['Year'] = dfpivot['Year'].apply(int)

dfpivot['Year'] = np.where(dfpivot['Year']>80,
                           1900 + dfpivot['Year'],
                           2000 + dfpivot['Year'])

dfpivot['Year'] = dfpivot['Year'].apply(str)

dfpivot['Quarter'] = dfpivot['Quarter'].replace({'1':'01-01',
                                              '2':'03-01',
                                              '3':'06-01',
                                              '4':'09-01'})

dfpivot['Date'] = (dfpivot['Year'] + '-' + dfpivot['Quarter'])

dfpivot['Date'] = pd.to_datetime(dfpivot['Date'])

Lastly, I collect the unique names of all metropolitan statistical areas from the initial dataframe. I left join this new smaller dataset called names to the newly formatted dataset.

names = df[['NAME','msa_fip']].loc[df.flag == 1].drop_duplicates()
names['msa_fip'] = names['msa_fip'].apply(str)
dfpivot['FIP'] = dfpivot['FIP'].apply(str)
names.set_index('msa_fip', inplace=True)
dfpivot.set_index('FIP', inplace=True)

output = (dfpivot.join(names,
                       how='left')).reset_index()

output['FIP'] = output['index']
output.drop(columns='index', inplace=True)

This final dataframe can now be saved as a CSV or Excel file for further analysis in Tableau.

The NAHB data after staging it in Python.

Now that the data has been staged and saved as a CSV, we can conduct deeper analysis. Using Tableau Public, I created two visualizations about the Housing Opportunity Index.

The first visualization highlights changes to the index across the country on average for all metropolitan-statistical-areas.

A time series plot of the HOI over time.

The second visualization is a scatterplot that compares the median home price in a metropolitan-statistical-area to the HOI in that area. As one may suspect, home prices inversely correlate with housing opportunity. In other words, greater affordability improves housing opportunity.

A scatterplot comparing the HOI to median home prices.