Python for Data Science: The 5 Things You Need to Get Started

Python is a general purpose programming language that was originally released in the early 1990’s. Over the years, it has become famous for being easy to read and learn

Like many modern programming languages, Python is open source which means it can be downloaded and used for free.

While Python is useful on its own, developers have created many packages that can be added to Python to extend it’s functionality even further.  

It is also worth noting that Python has been released in 3 major versions. While Python 2 has many active users, Python 3 is the future of Python.

If you are just getting started with Python, it is best to start with Python 3 so you are learning to work with the latest and greatest Python packages.

1: Anaconda

The Anaconda distribution of Python by Continuum Analytics is the first tool you need to get started with data science in Python.

Anaconda comes with many of the most popular Python packages in addition to an Integrated Development Environment (IDE) called Jupyter. (All for free!)

Getting started with Anaconda is straightforward. Navigate to this link to download Anaconda. Once installed, you will be greeted by Anaconda’s main screen.

Anaconda Navigator’s main screen.

From here you can launch applications and manage your Python packages.

2: Jupyter Notebooks

Being comfortable with Jupyter Notebooks is key for any aspiring data scientist.

Jupyter is famous for allowing Python developers to code in an interactive computing environment. Simply put, you can execute code as you write it.

Code in Jupyter notebooks are executed in cells. Open your own Jupyter notebook and type the following code in the first cell:

print("Jupyter is Great")

Once you write your code, press Shift+ Enter. Your code will be executed below the cell. 

Executing code from the first cell.

In the second cell, enter new code and press Shift + Enter.

Code in the second cell executed after the first cell.

Code in the second cell was executed separately than the first cell.

Jupyter is a powerful tool for data science. As you begin to use it more, its benefits will become even more apparent. 

3: Pandas

Pandas is a free Python package that allows developers to import, manipulate, and visualize data in tables called dataframes. Often times, you can complete work typically done in spreadsheets much faster in Pandas.

If you installed Anaconda, launch a Jupyter notebook to get started with Pandas. To use any Python package, you need to import the package.

In the first cell of your Jupyter notebook, type the following code and press Shift + Enter.

import pandas as pd

Now you are ready to import data into Pandas. I added a .csv file to the same folder that my Jupyter notebook is stored in. Execute the following code to import the data and save it as a variable titled df.

df = pd.read_csv('stocks.csv')

Once your data imports, you can execute the following command to view the first 5 rows of your data.

df.head()
Viewing the head of my data.

This just scrapes the surface of what Pandas is capable of.

4: Matplotlib

Matplotlib, like Pandas, is another Python package that is free to use. Matplotlib is used for visualizing data – or in other words, making graphs!

Visualizing your findings is the most important part of data science. If you are unable to communicate your findings to others, then your effectiveness as a data scientist is a limited.

Matplotlib is already installed with Anaconda. You can import it with the following code:

import matplotlib.pyplot as plt

I am plotting my dataframe (called with df) with this code:

df.plot(kind='bar',x='date',y='price',color='green',alpha=.25,ylim=(100,110), figsize=(12,8))

plt.show()

The result should look something like the following:

The bar graph of stock prices.

Matplotlib can customize graphs much more than this example. This example covered basic plotting, colors, transparency (alpha), axis limiting, and figure sizing.

5: Data

Data analysis is only as good as the data being used in the analysis. It is important that the data you use in your own work is structured and accessible.

Level of StructureDefinition
UnstructuredDatasets that do not conform with any unified format. Ex. Audiofiles, text, pictures.
Semi-structured Datasets that do not conform with the formal structure of a database or spreadsheet (rows and columns), but is organized by tagging or markers. Ex. Word Document with Comments
StructuredDatasets that conform with the formal structure of databases or spreadsheets (rows and columns). Data in this format can be used for rapid calculation and other forms of computation. Ex. a SQL Database

Structured data is ideal for nearly every data science application. Data in this format, however, can be difficult, costly, or time consuming, to collect on your own. 

For those starting out in data science, there are many free-to-use data sources available online.

Some of my favorites include the Census, the American Community Survey, stock prices, Zillow research, and Google Trends

Data science will continue to evolve. As our analysis tools improve, our need for such structured data may decline. Many data scientists are already using tools like Natural Language Processing and Computer Vision to analyze data in unstructured formats.