# import the pandas package
import pandas as pd
Reading Data with Pandas
In the last lesson, we learned how pandas
stores data as rows and columns in DataFrames
. We previously used a small dataset that was hard-coded right in the notebook. But in the real world, we want to be able to use large datasets that can’t be easily hard-coded or typed out by hand. One way that we can store large datasets as files is in the CSV format. This is a format which can be opened by many different programs like Excel, Google Sheets, or our Python programs, which allows us to share data easily.
Let’s start by importing pandas. We can use the pd
nickname like before:
Now we’re ready to read our dataset into Python with pandas
! We’ll use a function called read_csv
. Our dataset is in our GWC GitHub repository, and we need to tell read_csv
exactly where to find it. read_csv
will create a DataFrame
for us. Let’s call it tips
:
# load the tips csv
= 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
path = pd.read_csv(path + 'SampleData/tips.csv') tips
Since we saved the data to a variable, pandas
didn’t show us what it looks like. How would you view the beginning of the tips
DataFrame
without seeing every row? Try it below:
# View just the beginning of the tips DataFrame
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Look at the column names of the tips
DataFrame
. We have total_bill
, tip
, sex
, smoker
, day
, time
, and size
. Based on the column names and some of the values in the DataFrame
, what do you think the rows each represent?
The rows represent: dining parties in a restaurant
Now let’s take a look at the end of the DataFrame
:
# View the end of the tips DataFrame
tips.tail()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
Notice the numbers on the far left side of the DataFrame
. pandas
assigned a number to every row. What number did pandas
assign to the very first row of the DataFrame
? (Scroll up if you need to.) So how many rows do we have in this DataFrame
?
Number of rows: 244
The column of numbers that label the rows is called the index
of the DataFrame
. The index
is an attribute
, a special variable which belongs to variables of the DataFrame
type. An example of an attribute would be if you had a variable dog
with an attribute dog.owner
to store the name of the person who owns the dog.
We can view the DataFrame
’s index
like this:
# view the index
tips.index
RangeIndex(start=0, stop=244, step=1)
So our index starts at 0, ends at 244, and increases by 1 for each row. Another way to count the number of rows is to take the length of the index using the len
function:
# get the length of the index
len(tips.index)
244
Like the index
labels the rows of the DataFrame
, there is an attribute
called columns
that refers to the columns of the DataFrame
. Let’s take a look:
# view the columns
tips.columns
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
We could count the number of columns – there aren’t too many – but what’s the fun in that? Let’s write a line of code to tell us the number of columns:
# length of the DataFrame's columns
len(tips.columns)
7
Conveniently, we can also call len
on the DataFrame
itself. Try it here! Is the result equal to the number of rows or the number of columns?
# use len on tips
len(tips)
244
Based on the number of rows and columns, how many data points are in the tips
DataFrame
?
# calculate the number of data points in tips
7 * 244
1708
That’s a lot more data than we’ve handled before. But that’s nothing for pandas
– it can handle DataFrames
with millions of rows! Data scientists use pandas
to handle very large datasets from the real world.
Instead of typing the number of rows and columns in the DataFrame, we could put both commands with len
on the same line. Try it here:
# Multiply the length of rows & columns without typing numbers
len(tips) * len(tips.columns)
This way, if the tips
data changes, we can quickly re-run the above cell to find the number of values in it, without having to manually type out the number of rows and columns.
You just learned:
- How to read datasets into
pandas
DataFrames
. - The
index
andcolumns
attributes ofDataFrames
. - How to find the number of rows, columns, and number of data points in a
DataFrame
.