Hi! I’m Chris Walker.

I am an Analytics Associate for Fannie Mae in Washington D.C.

I graduated with honors from Texas A&M University with a BS in Urban and Regional Planning with a minor in Economics. In the past, I have held roles as a Business Analyst Intern, Project Lead Intern, and Data Analytics and Marketing Intern.

With the world creating more data than ever before, it is important that communities and businesses leverage the power of data to improve.

I am passionate about using data to solve problems, develop communities, and advance businesses.

Feel free to contact me on LinkedIn or via email at cjwalker@aggienetwork.com.

Zillow: Taking a Look at the Real Estate Marketing Giant

Zillow provides prospective homebuyers and renters with a toolkit to search for their next residence. Users can filter by price, number of bedrooms/bathrooms, home type, square footage, lot size, and gain access to Zillow’s famous Zestimate – all for free. Even better, homeowners, agents, and rental property managers can list residencies for free as well.

How is it, with so many free services, that Zillow earns over $1 billion in annual revenue and employs thousands of people? As is the case with many online media platforms, it is primarily through the sale of ads.

Much like ads on Google search pages, Zillow allows advertisers to reach specific audiences through advertisements on their platform. Advertisers range from rental agencies to interior designers – all people who need to reach subsets of prospective renters and buyers.

An ad from an internet service provider in Texas

Zillow maintains other revenue streams too. In combination with ads on Zillow’s website, Zillow advertisers can take run ads on other sites in the Zillow Rental Network. This network of websites includes Trulia, Hotpads, and even AOL Real Estate.

Beyond advertising, Zillow has started Zillow Offers; This program provides cash offers for homeowners who wish to expedite the sale of their home. Zillow connects sellers with a representative who helps them work through the sale and closing of their home.

Zillow also offers real estate agents the opportunity to become a Premiere Agent. Premiere Agents pay for Zillow to provide additional services to help run their agency. Premiere Agents can receive leads from Zillow’s network of sites and Facebook. They can also gain access to a CRM and even a website for their agency.

Zillow is owned by the Zillow Group, a company which oversees Zillow, Trulia, Hotpads, Zillow Home Loans, and a number of other organizations. With such a broad network of websites, services, and users, its no wonder that Zillow attracts people from all over the housing industry.

NAHB: Data Staging in Python

The National Association of Home Builders or NAHB is an organization that represents housing interests within the United States. This includes affordable housing, home building techniques and methods, and the promotion of home ownership.

In conjunction with these goals and functions, NAHB releases data about housing in the United States. The data tracks various metrics including Median Income, Interest Rates, and NAHB’s very own Housing Opportunity Index or HOI.

NAHB calculates the HOI by looking comparing home prices to income in a given area. For example, if income rises in Dallas-Ft. Worth and home prices remain the same, then HOI increases. Alternatively, if income remains the same while home prices rise, HOI falls.

I wanted to visualize this dataset in Tableau, however, when I opened the spreadsheet, it was in a format that is incompatible with Tableau.

The raw NAHB spreadsheet.

While the format is acceptable for basic spreadsheet analysis, it lacks the proper long-form layout required for analysis in Tableau. Using Python, I wanted to convert this spreadsheet into a CSV with the following features:

  • Variable name column headers (Median Income, FIP, etc.)
  • One row per quarter per metropolitan statistical area
  • Proper datetime and numerical formats

My Python script begins by importing the proper dependencies. In this instance, I need pandas, numpy, and datetime.

import pandas as pd
import numpy as np
import datetime as dt

Next, I use Pandas to read the Excel file, remove unneeded rows, and melt the pivoted table format. From there, I renamed two columns for better readability.

df = pd.read_excel('housingdata.xls')
df = df.loc[df.NAME.notnull()]

df1 = df[~df['flag'].isin([1,8])]

dfmelt = pd.melt(df1, id_vars=['msa_fip','NAME','flag'])

dfmelt.rename(columns={'variable':'date',
                       'NAME':'variable'},
              inplace=True)

I use the melted dataset to create a new column called index which is used to pivot the data in a format that is readable by Tableau.

dfmelt['msa_fip'] = dfmelt['msa_fip'].apply(str)

dfmelt['index'] = dfmelt['msa_fip'] + dfmelt['date']

dfpivot = dfmelt.pivot(index='index', 
                       columns='variable', 
                       values='value').reset_index()

Next, I create new columns from slices of existing columns to create a proper datetime column.

dfpivot['FIP'] = dfpivot['index'].str[:5]
dfpivot['Quarter'] = dfpivot['index'].str[6:7]
dfpivot['Year'] = dfpivot['index'].str[8:]

dfpivot['Year'] = dfpivot['Year'].apply(int)

dfpivot['Year'] = np.where(dfpivot['Year']>80,
                           1900 + dfpivot['Year'],
                           2000 + dfpivot['Year'])

dfpivot['Year'] = dfpivot['Year'].apply(str)

dfpivot['Quarter'] = dfpivot['Quarter'].replace({'1':'01-01',
                                              '2':'03-01',
                                              '3':'06-01',
                                              '4':'09-01'})

dfpivot['Date'] = (dfpivot['Year'] + '-' + dfpivot['Quarter'])

dfpivot['Date'] = pd.to_datetime(dfpivot['Date'])

Lastly, I collect the unique names of all metropolitan statistical areas from the initial dataframe. I left join this new smaller dataset called names to the newly formatted dataset.

names = df[['NAME','msa_fip']].loc[df.flag == 1].drop_duplicates()
names['msa_fip'] = names['msa_fip'].apply(str)
dfpivot['FIP'] = dfpivot['FIP'].apply(str)
names.set_index('msa_fip', inplace=True)
dfpivot.set_index('FIP', inplace=True)

output = (dfpivot.join(names,
                       how='left')).reset_index()

output['FIP'] = output['index']
output.drop(columns='index', inplace=True)

This final dataframe can now be saved as a CSV or Excel file for further analysis in Tableau.

The NAHB data after staging it in Python.

Now that the data has been staged and saved as a CSV, we can conduct deeper analysis. Using Tableau Public, I created two visualizations about the Housing Opportunity Index.

The first visualization highlights changes to the index across the country on average for all metropolitan-statistical-areas.

A time series plot of the HOI over time.

The second visualization is a scatterplot that compares the median home price in a metropolitan-statistical-area to the HOI in that area. As one may suspect, home prices inversely correlate with housing opportunity. In other words, greater affordability improves housing opportunity.

A scatterplot comparing the HOI to median home prices.

Implementing Fuzzy Matching in Python

Text is all around us; essays, articles, legal documents, text messages, and news headlines are consistently present in our daily lives. This abundance of text provides ample opportunities to analyze unstructured data.


Imagine you are playing a game where someone hands you a index card with the misspelled name of a popular musician. In addition, you have a book containing correctly spelled names of popular musicians. The goal is for you to return the correct spelling of the misspelled name.

In this example, suppose someone hands you a card with “Billie Jole” written on it. You quickly open the book of musicians, find names beginning with the letter B, and find the name “Billy Joel.”

As a human, this was easy for you to complete, but what if you wanted to automate this task? This can be done using Fuzzy Logic, or more specifically, the Levenshtein distance.


The Levenshtein distance considers two pieces of text and determines the minimum number of changes required to convert one string into another. You can utilize this logic to find the closest match to any given piece of text.

I am going to focus on implementing the mechanics of finding a Levenshtein distance in Python rather than the math that makes it possible. There are many resources on YouTube which explain how the Levenshtein distance is calculated.


First, import numpy and define a function. I called the function lv as shorthand for Levenshtein distance. Our function requires two input strings which are used to create a 2D matrix that is one greater than the length of each string.

def ld(s1, s2):    
    rows = len(s1)+1
    cols = len(s2)+1
    dist = np.zeros([rows,cols])

If you were to use the strings “pear” and “peach” in this instance, the function should create a 5 by 6 matrix filled with zeros.

A matrix of zeros.

Next, the first row and column need to count up from zero. Using for loops, we can iterate over the selected values. Our Python function now creates the following matrix.

def ld(s1, s2):
    rows = len(s1)+1
    cols = len(s2)+1
    dist = np.zeros([rows,cols])
    
    for i in range(1, rows):
        dist[i][0] = i
    for i in range(1, cols):
        dist[0][i] = i
A matrix set up for finding the Levenshtein distance.

Finally, we need to iterate over every column and row combination. By doing this, we can find the minimum value of the cells directly above, to the left, and above to the left of each cell. After the minimum is found, our Python script adds one to this value to the location in question.

def ld(s1, s2):
    rows = len(s1)+1
    cols = len(s2)+1
    dist = np.zeros([rows,cols])
    
    for i in range(1, rows):
        dist[i][0] = i
    for i in range(1, cols):
        dist[0][i] = i
        
    for col in range(1, cols):
        for row in range(1, rows):
            if s1[row-1] == s2[col-1]:
                cost = 0
            else:
                cost = 1
            dist[row][col] = min(dist[row-1][col] + 1,
                                 dist[row][col-1] + 1,      
                                 dist[row-1][col-1] + cost)
            return dist[-1][-1]

Our matrix should now look like the following with the far bottom right cell representing the number of changes required to convert one string into another. In this instance, it requires 2 changes to convert “peach” into “pear”; deleting the letter “c” in “peach” and replacing the letter “h” with the letter “r”.

A completed Levenshtein distance matrix. The bottom right number (in gold) represents the number of changes required.

What is so great about this function is that it is adaptable and will accept a string of any length to compute the number of changes required. While the mechanics behind this function are relatively simple, its use cases are vast.

Writing a Machine Learning Classifier: K-Nearest Neighbors

Machine learning is a subset of Artificial Intelligence. Despite the elusive title, machine learning simply refers to developing methods for computers to learn from new information and make predictions.

Imagine you wish to write a program that can identify the species of a flower based on a few measurements. You could write a series of if-statements that guide a computer to classify the flower’s species. However, there are two key issues with this approach.

First, if a human is explicitly writing the rules for which the computer classifies the flower’s species, there is likely to be bias induced from a human’s inability to understand all of the data required to classify a flower. Second, if new data is introduced, the rules (if-statements) must be rewritten, taking valuable time.

For these reasons, a new solution is needed. This solution must adapt to new data, require no explicit writing of rules, and be computationally efficient. In other words, we need a program that can learn.

K-Nearest Neighbors

When discussing machine learning, there is a myriad of methods and models to choose from. Some of these models blur the lines of classical statistics including forms of regression while others replicate the structure of the human brain using neurons.

To solve our classification problem, we will be using a model titled K-Nearest Neighbors. This model, as the name suggests, uses the assumption that if a new data point is added to a model, it is likely that the new data point is of the same type as it’s nearest already classified neighbor.

A visual representation of a K-Nearest Neighbor Classifier

In the example above, the x-axis denotes a flower’s petal width while the y-axis denotes the petal’s length. You can see that blue flowers have smaller petal lengths than red flowers but larger petal widths (and vice versa). Let’s say you add a new point (shown in yellow). What type of flower is the yellow point? Red or blue?

According to the model, it is a red flower. This is because it is physically closest to a data point that is already classified as red.

Writing a KNN Classifier

Using Python 3 and Jupyter Notebooks, I have written my own KNN Classifier. While pre-made KNN classifiers are widely available, writing your own provides greater knowledge of how the model works. I begin by importing the necessary Python packages for this program.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%xmode Minimal
%matplotlib inline

Once the packages are imported, I load a famous machine learning dataset called Iris. It includes four columns which provide measurements about 150 Iris flowers for the model to learn and a fifth column that includes the classification of the flower.

path = 'iris.csv'
testing_size = .2
data = pd.read_csv(path, header = None)

From here, I determine the classifications stored as text in the Pandas DataFrame. In the following code, I create a dictionary of the unique values with their associated number and apply the classification to the DataFrame column.

classes = pd.Series(data.iloc[:,-1].unique())
i = 1
dictionary = {}
for item in classes:
    dictionary.update( {item : i} )
    i = i + 1
data = data.replace(dictionary)

Next, I convert the DataFrame into a NumPy array and randomly shuffle the array. I also slice the resulting array into 80% training data (for the model to learn) and 20% testing data (to test how accurate the model is).

array = np.array(data)
np.random.shuffle(array)
test_num = round((np.size(array,0))*testing_size)
test = array[:(test_num),:]
train = array[(test_num + 1):,:] 

After preparing the data, I use NumPy array broadcasting to determine the distances of each training data point from each testing data point. I then use NumPy functions to locate the index of the closest point from each training point and capture the classification it predicted.

input_array = test[:,:-1]
x = []
for row in input_array:
    distances = (((row - train[:,:-1])**2)**.5).sum(axis=1)
    min_index = np.argmin(distances)
    classification = train[min_index,-1]
    x.append(classification)
    predict = np.array(x)[:,np.newaxis]

Finally, I combine the predicted classification with the actual classification of the testing data to determine the accuracy of the model.

output = np.hstack((test,predict))
correct = np.count_nonzero(output[:,-1] == output[:,-2])
total = np.size(output[:,-1])
accuracy = round(((correct/total)*100),2)

The classifier takes (on average) 1 millisecond to run through the data and the model is always 90% accurate (or greater depending on how the NumPy array was randomized).

While the model developed here is not nearly as optimized as Scikit-Learn (especially for larger datasets), it does provide insight as to how a relatively simple classification model is developed and tested. More importantly, it reveals that machine learning, while very clever, is nothing more than mathematics.

Visualizing Mortgage Loan Originations in the United States

Every year, the Federal National Mortgage Mortgage Association, commonly known as Fannie Mae (traded as FNMA), provides financial backing to thousands of mortgage lenders across the United States.

In an effort to promote home ownership, Fannie Mae expands the secondary mortgage market by purchasing mortgage loans from lenders and packaging them into Mortgage Backed Securities (MBSs).

This process allows mortgage lenders to reinvest their assets and originate more mortgages. This effectively increases the number of mortgage lenders in the United States by reducing reliance on lender reserves.

Fannie Mae publishes aggregated data on the mortgage loans they purchase using a tool called Data Dynamics.

Using Data Dynamics and Tableau, I located and visualized data on single-family home mortgages originated to serve as a primary residence.

This first visualization shows the count of mortgage loans purchased by Fannie Mae broken down by credit score.

Credit Range Color Key

This key applies to all subsequent data visualizations.

In the early 2000s, a wide variety of credit score ranges were represented across millions of mortgage loans. Leading up to 2008, the total number of loan purchases decreased.

Following the Great Recession, lending to low credit score individuals decreased with a slow increase to the 620-660 credit range starting in 2010.

Mortgage loan count by year broke down by credit score.

I wanted to take a closer look at the percentage of mortgage loans from each credit score range regardless of the count.

This highlights the diverse credit score ranges that were accepted in the early 2000s, tightening percentages on low scores near 2008, and a steady increase of credit score diversity into 2016.

Percent share of mortgage loans by credit score.

Using Loan to Value (LTV) data from Data Dynamics, I was able to calculate an estimated average down payment percentage on mortgages Fannie Mae purchased for each year.

Estimated Down Payment % = (1 – LTV) * 100

The average down payment, according to Data Dynamics, steadily increased across all credit ranges until 2006. From 2006 to 2008, the average down payment fell from 23% to 21%.

During the recession, down payments increased to 25% and have been declining since 2010.

The average down payment on Fannie Mae mortgages for each year.

Data Dynamics also provides data on borrower’s Debt to Income Ratios (DTIs). Using a box and whisker plot, I was able to visualize the average DTIs for each credit score range.

As expected, borrowers with lower credit scores (0-620 and 620-660) had the highest DTI’s. Borrowers with the highest credit scores (780+) always had the lowest DTIs.

A box and whisker plot that shows yearly DTI trends for each credit score.

Every day, Fannie Mae helps Americans achieve the dream of homeownership. Through innovative financing solutions, Americans are able to build equity in their homes and live more enriched lives.

Tableau: Visualizing Mortgages and Education

What is Tableau?

From business intelligence to academic research, Tableau is a leader in the world of data – and rightfully so.

In recent years Tableau has released a free version of Tableau called Tableau Public. This free version provides all of the same wonderful visualization tools as paid variants with a few key drawbacks.

  1. You can only connect to several data sources (No SQL databases)
  2. You cannot save Tableau projects locally
  3. Your projects must be published on Tableau’s website

Despite these drawbacks, Tableau Public is a great way to start using Tableau software.

In this example, I am using Tableau Public to visualize the percentage of homes with a mortgage and the percentage of people with a bachelor’s degree in the state of Texas.

All visualizations draw data from the 2017 American Community Survey.

Visualize the Data

Firstly, let’s look at the percentage of homes with a mortgage across census tracts in Texas.

Percentage of Homes with a Mortgage


The percentage of homes with a mortgage in Texas census tracts

Areas with high percentages of homes with a mortgage are concentrated in specific areas such as north Dallas-Ft. Worth, Austin, and west Houston.

Now let’s look at a similar map that highlights the percentage of people with a bachelor’s degree.

Percentage of People with a Bachelor’s Degree

The percentage of people with a bachelor’s degree in Texas census tracts.

With few exceptions, the two maps display similar concentrations around metropolitan areas.

Finally, let’s look at a scatter plot that compares the two variables in one visualization.

Percent with Mortgage vs. Percent with Bachelor’s

A scatter plot that compares mortgages and bachelor’s degrees in Texas.

The regression line found in this scatter plot was determined to have an R-squared value of .57. As a result, there is some correlation between the two variables.

Overall, this simple exploratory data analysis project just scratches the surface of Tableau’s capabilities.

Lean and Mean: 4 Lean Management Techniques That Win

The year was 1913 when Henry Ford developed a fully functioning moving assembly line. Using interchangeable parts, specialized machinery, and diligent hands, Ford was able to reduce the time to produce a car by 10 hours. What Ford lacked, however, was variety.

This is where Toyota, one of the largest adopters of lean principles, steps in. Toyota wanted to produce cars in conjunction with demand without sacrificing quality or variety.

Toyota introduced self-monitoring machines that would notify the previous machine of its material needs. These innovations allowed Toyota to produce on demand at a low cost with high variety and quality.

The principles Toyota implemented following World War II resonated with manufacturers around the world and have been found effective for office teams as well.

Here are 4 lean management techniques that foster a culture of constant improvement within teams:

1. Team Huddles

Having short, regular meetings with your coworkers ensures everyone is in the loop.

Using a huddle board is an effective way to monitor everyone’s projects. A huddle board often displays projects in sorted into columns.

Huddle boards can be digital, like Trello or Asana, or physical like a white board.

An example of a digital huddle board.

Use a decision chart like the one shown below when determining task priorities.

Projects and tasks that are high urgency and require low effort should be completed first while projects that fall in other quadrants can be completed later.

A task decision chart.

If a particular problem arises during a huddle, end the huddle and launch a problem solve.

2. Problem Solve with a Five-Why

Problem solving sessions are held as needed. Toyota utilizes problem solves to address issues with car production as soon as they occur.

One common lean management strategy used in problem solving sessions is called a “five-why”.

A five-why follows the assumption that most problems can be addressed within five “whys” as the name suggests. The following is an example of a five-why:

Using a five-why allows for the root of problems to uncovered quickly.

3. Feedback Surveys

Effective managers should be gathering feedback from their team on a regular basis.

Asking your team members to answer questions such as “Are all of your needs met?” and “Was your work life balance sustainable?” help managers understand the needs of their team.

One good way to collect survey responses is through a Google Form that employees submit weekly. If the form is connected to a Google Sheet, it is simple for managers to gather and analyze employee feedback.

4. Minimum Viable Product (MVP)

For startups and established enterprises alike, focusing on developing a minimum viable product is key. Rather than develop a complete prototype and release it to the public, release incremental updates and gather market feedback along the way.

This ensures that all development of new services and products is in line with public demand.

Keep It Lean

Implementing these lean principles within your projects can provide increased productivity and reduced waste. Lean management can be applied to a variety of disciplines to make work-life better.

National Planning Conference 2018: My Experience

In the fall of 2017, I was presented with the opportunity to complete a major project in lieu of several class assignments.

After personally witnessing the decline of several shopping malls in north Dallas, I wanted to focus on how mixed-use developments can help revitalize areas previously occupied by shopping malls.

I selected a Collin Creek Mall in east Plano, Texas for my project. This mall, built in the early 1980s, has lost foot traffic and shop leases in recent years.

Using the existing property lines, I designed a mixed-use development that includes commercial and residential spaces. The following is an aerial view of my design:

An aerial view of the proposed development.

At the end of the semester, I presented the design to my classmates. It was around this time that my professor suggested that I submit my project to the 2018 National Planning Conference.

A closer look at the River Side Community.

I submitted the project and was accepted as the only undergraduate student from Texas A&M to present in at the conference in New Orleans! The school also agreed to reimburse me for my conference expenses.

By now, it was February and I had until April of 2018 to prepare my presentation. By April, I had my poster printed and ready for the conference.

When I arrived at the conference, vendors and other speakers began to set up their booths. I found a spot for my poster among graduate students from Texas A&M and other schools.

Presenting my poster at NPC18.

I presented my poster to economists, urban planners, data scientists, and real estate professionals throughout the day. I spent my time answering questions about why I chose Collin Creek Mall, the software I used to create the models, and different aspects of the design.

Overall, presenting at NPC18 was a great experience. I had the opportunity to speak with well established professionals and share my work over the past two semesters, an invaluable exercise.

Python for Data Science: The 5 Things You Need to Get Started

Python is a general purpose programming language that was originally released in the early 1990’s. Over the years, it has become famous for being easy to read and learn

Like many modern programming languages, Python is open source which means it can be downloaded and used for free.

While Python is useful on its own, developers have created many packages that can be added to Python to extend it’s functionality even further.  

It is also worth noting that Python has been released in 3 major versions. While Python 2 has many active users, Python 3 is the future of Python.

If you are just getting started with Python, it is best to start with Python 3 so you are learning to work with the latest and greatest Python packages.

1: Anaconda

The Anaconda distribution of Python by Continuum Analytics is the first tool you need to get started with data science in Python.

Anaconda comes with many of the most popular Python packages in addition to an Integrated Development Environment (IDE) called Jupyter. (All for free!)

Getting started with Anaconda is straightforward. Navigate to this link to download Anaconda. Once installed, you will be greeted by Anaconda’s main screen.

Anaconda Navigator’s main screen.

From here you can launch applications and manage your Python packages.

2: Jupyter Notebooks

Being comfortable with Jupyter Notebooks is key for any aspiring data scientist.

Jupyter is famous for allowing Python developers to code in an interactive computing environment. Simply put, you can execute code as you write it.

Code in Jupyter notebooks are executed in cells. Open your own Jupyter notebook and type the following code in the first cell:

print("Jupyter is Great")

Once you write your code, press Shift+ Enter. Your code will be executed below the cell. 

Executing code from the first cell.

In the second cell, enter new code and press Shift + Enter.

Code in the second cell executed after the first cell.

Code in the second cell was executed separately than the first cell.

Jupyter is a powerful tool for data science. As you begin to use it more, its benefits will become even more apparent. 

3: Pandas

Pandas is a free Python package that allows developers to import, manipulate, and visualize data in tables called dataframes. Often times, you can complete work typically done in spreadsheets much faster in Pandas.

If you installed Anaconda, launch a Jupyter notebook to get started with Pandas. To use any Python package, you need to import the package.

In the first cell of your Jupyter notebook, type the following code and press Shift + Enter.

import pandas as pd

Now you are ready to import data into Pandas. I added a .csv file to the same folder that my Jupyter notebook is stored in. Execute the following code to import the data and save it as a variable titled df.

df = pd.read_csv('stocks.csv')

Once your data imports, you can execute the following command to view the first 5 rows of your data.

df.head()
Viewing the head of my data.

This just scrapes the surface of what Pandas is capable of.

4: Matplotlib

Matplotlib, like Pandas, is another Python package that is free to use. Matplotlib is used for visualizing data – or in other words, making graphs!

Visualizing your findings is the most important part of data science. If you are unable to communicate your findings to others, then your effectiveness as a data scientist is a limited.

Matplotlib is already installed with Anaconda. You can import it with the following code:

import matplotlib.pyplot as plt

I am plotting my dataframe (called with df) with this code:

df.plot(kind='bar',x='date',y='price',color='green',alpha=.25,ylim=(100,110), figsize=(12,8))

plt.show()

The result should look something like the following:

The bar graph of stock prices.

Matplotlib can customize graphs much more than this example. This example covered basic plotting, colors, transparency (alpha), axis limiting, and figure sizing.

5: Data

Data analysis is only as good as the data being used in the analysis. It is important that the data you use in your own work is structured and accessible.

Level of StructureDefinition
UnstructuredDatasets that do not conform with any unified format. Ex. Audiofiles, text, pictures.
Semi-structured Datasets that do not conform with the formal structure of a database or spreadsheet (rows and columns), but is organized by tagging or markers. Ex. Word Document with Comments
StructuredDatasets that conform with the formal structure of databases or spreadsheets (rows and columns). Data in this format can be used for rapid calculation and other forms of computation. Ex. a SQL Database

Structured data is ideal for nearly every data science application. Data in this format, however, can be difficult, costly, or time consuming, to collect on your own. 

For those starting out in data science, there are many free-to-use data sources available online.

Some of my favorites include the Census, the American Community Survey, stock prices, Zillow research, and Google Trends

Data science will continue to evolve. As our analysis tools improve, our need for such structured data may decline. Many data scientists are already using tools like Natural Language Processing and Computer Vision to analyze data in unstructured formats.