# import the pandas package
import pandas as pd
# set the path
= 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
path # load tips
= pd.read_csv(path + 'SampleData/tips.csv') tips
Subsetting Pandas DataFrames I
You now know how to read external datasets into pandas
. Let’s put those skills to use and read in the tips
dataset again:
Take a look again at the beginning of the tips
DataFrame
:
# view the beginning of tips
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
What if we decided we didn’t want to keep all of the data recorded in this dataset? To do that, we need to learn how to subset
DataFrames
. Subsetting means taking a dataset and pulling out a small portion of it that we’re interested in.
First, we’ll look at a single column (you can use head
to keep the printed result short):
# subset one column
'day'].head(10) tips[
0 Sun
1 Sun
2 Sun
3 Sun
4 Sun
5 Sun
6 Sun
7 Sun
8 Sun
9 Sun
Name: day, dtype: object
We use the square brackets [ ] after the name of the DataFrame
to tell pandas
that we want to look at one of the columns. We put the name of the column in quotes to tell pandas
exactly which column we want to look at. Try subsetting the total_bill
column:
# subset the total_bill column
'total_bill'].head(10) tips[
0 16.99
1 10.34
2 21.01
3 23.68
4 24.59
5 25.29
6 8.77
7 26.88
8 15.04
9 14.78
Name: total_bill, dtype: float64
pandas
simply showed us the result of subsetting the column, but it didn’t save the result anywhere. Try saving the total_bill
column to a new variable, bills
:
# save the total_bill column to a variable
= tips['total_bill'] bills
We can also pull out multiple columns at a time to create a new DataFrame
. If we were only interested in the total_bill
and tip
, we can subset them like this:
# subset the columns total_bill and tip
'total_bill', 'tip']].head(10) tips[[
total_bill | tip | |
---|---|---|
0 | 16.99 | 1.01 |
1 | 10.34 | 1.66 |
2 | 21.01 | 3.50 |
3 | 23.68 | 3.31 |
4 | 24.59 | 3.61 |
5 | 25.29 | 4.71 |
6 | 8.77 | 2.00 |
7 | 26.88 | 3.12 |
8 | 15.04 | 1.96 |
9 | 14.78 | 3.23 |
Does that look familiar? Instead of putting a single string between the square brackets, we put a whole list of strings – you can tell it’s a list by the second set of square brackets. You can also create the list of columns you’re interested in and subset the dataframe in two separate steps. This code works exactly the same as what we just did above.
= ['total_bill', 'tip']
columns 10) tips[columns].head(
total_bill | tip | |
---|---|---|
0 | 16.99 | 1.01 |
1 | 10.34 | 1.66 |
2 | 21.01 | 3.50 |
3 | 23.68 | 3.31 |
4 | 24.59 | 3.61 |
5 | 25.29 | 4.71 |
6 | 8.77 | 2.00 |
7 | 26.88 | 3.12 |
8 | 15.04 | 1.96 |
9 | 14.78 | 3.23 |
Now you try: subset the columns total_bill
, tip
, and time
and save the result to a variable called tips_subset
:
# subset three columns and save to a new variable
= tips[['total_bill', 'tip', 'time']]
tips_subset
# take a look at the beginning of the new DataFrame
tips_subset.head()
total_bill | tip | time | |
---|---|---|---|
0 | 16.99 | 1.01 | Dinner |
1 | 10.34 | 1.66 | Dinner |
2 | 21.01 | 3.50 | Dinner |
3 | 23.68 | 3.31 | Dinner |
4 | 24.59 | 3.61 | Dinner |
Now we’ve learned how to subset columns. How do we subset rows? We use a method
of DataFrame
called iloc
. When you see iloc
, think “index location” – because we want to get the location where the row is a certain index. Let’s try it:
# subset a row
1] tips.iloc[
total_bill 10.34
tip 1.66
sex Male
smoker No
day Sun
time Dinner
size 3
Name: 1, dtype: object
That showed us the row with an index of 1. Similarly to subsetting columns, we can also subset multiple rows:
# subset the first three rows
0,1,2]] tips.iloc[[
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
That gave us a smaller DataFrame
where the rows have an index of 0, 1, or 2. We can do the same thing with slicing syntax:
# subset multiple rows
0:3] tips.iloc[
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
Notice that this does the same thing as calling head
with a value of 3:
# use head
3) tips.head(
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
What if we want to grab some rows in the middle of the DataFrame
? Try subsetting the 100th through 105th row. Hint: Don’t forget that counting starts at 0
in Python!
# subset the 100th row through the 105th row
99:105] tips.iloc[
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
99 | 12.46 | 1.50 | Male | No | Fri | Dinner | 2 |
100 | 11.35 | 2.50 | Female | Yes | Fri | Dinner | 2 |
101 | 15.38 | 3.00 | Female | Yes | Fri | Dinner | 2 |
102 | 44.30 | 2.50 | Female | Yes | Sat | Dinner | 3 |
103 | 22.42 | 3.48 | Female | Yes | Sat | Dinner | 2 |
104 | 20.92 | 4.08 | Female | No | Sat | Dinner | 2 |
Congrats on making it to the end of this lesson – we learned a lot!
- How to use square brackets
[]
to subset columns. - How to use
iloc
to subset rows.