Subsetting Pandas DataFrames I

You now know how to read external datasets into pandas. Let’s put those skills to use and read in the tips dataset again:

# import the pandas package
import pandas as pd
# set the path
path = 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
# load tips
tips = pd.read_csv(path + 'SampleData/tips.csv')

Take a look again at the beginning of the tips DataFrame:

# view the beginning of tips
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

What if we decided we didn’t want to keep all of the data recorded in this dataset? To do that, we need to learn how to subset DataFrames. Subsetting means taking a dataset and pulling out a small portion of it that we’re interested in.

First, we’ll look at a single column (you can use head to keep the printed result short):

# subset one column
tips['day'].head(10)
0    Sun
1    Sun
2    Sun
3    Sun
4    Sun
5    Sun
6    Sun
7    Sun
8    Sun
9    Sun
Name: day, dtype: object

We use the square brackets [ ] after the name of the DataFrame to tell pandas that we want to look at one of the columns. We put the name of the column in quotes to tell pandas exactly which column we want to look at. Try subsetting the total_bill column:

# subset the total_bill column
tips['total_bill'].head(10)
0    16.99
1    10.34
2    21.01
3    23.68
4    24.59
5    25.29
6     8.77
7    26.88
8    15.04
9    14.78
Name: total_bill, dtype: float64

pandas simply showed us the result of subsetting the column, but it didn’t save the result anywhere. Try saving the total_bill column to a new variable, bills:

# save the total_bill column to a variable
bills = tips['total_bill']

We can also pull out multiple columns at a time to create a new DataFrame. If we were only interested in the total_bill and tip, we can subset them like this:

# subset the columns total_bill and tip
tips[['total_bill', 'tip']].head(10)
total_bill tip
0 16.99 1.01
1 10.34 1.66
2 21.01 3.50
3 23.68 3.31
4 24.59 3.61
5 25.29 4.71
6 8.77 2.00
7 26.88 3.12
8 15.04 1.96
9 14.78 3.23

Does that look familiar? Instead of putting a single string between the square brackets, we put a whole list of strings – you can tell it’s a list by the second set of square brackets. You can also create the list of columns you’re interested in and subset the dataframe in two separate steps. This code works exactly the same as what we just did above.

columns = ['total_bill', 'tip']
tips[columns].head(10)
total_bill tip
0 16.99 1.01
1 10.34 1.66
2 21.01 3.50
3 23.68 3.31
4 24.59 3.61
5 25.29 4.71
6 8.77 2.00
7 26.88 3.12
8 15.04 1.96
9 14.78 3.23

Now you try: subset the columns total_bill, tip, and time and save the result to a variable called tips_subset:

# subset three columns and save to a new variable
tips_subset = tips[['total_bill', 'tip', 'time']]

# take a look at the beginning of the new DataFrame
tips_subset.head()
total_bill tip time
0 16.99 1.01 Dinner
1 10.34 1.66 Dinner
2 21.01 3.50 Dinner
3 23.68 3.31 Dinner
4 24.59 3.61 Dinner

Now we’ve learned how to subset columns. How do we subset rows? We use a method of DataFrame called iloc. When you see iloc, think “index location” – because we want to get the location where the row is a certain index. Let’s try it:

# subset a row
tips.iloc[1]
total_bill     10.34
tip             1.66
sex             Male
smoker            No
day              Sun
time          Dinner
size               3
Name: 1, dtype: object

That showed us the row with an index of 1. Similarly to subsetting columns, we can also subset multiple rows:

# subset the first three rows
tips.iloc[[0,1,2]]
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3

That gave us a smaller DataFrame where the rows have an index of 0, 1, or 2. We can do the same thing with slicing syntax:

# subset multiple rows
tips.iloc[0:3]
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3

Notice that this does the same thing as calling head with a value of 3:

# use head
tips.head(3)
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3

What if we want to grab some rows in the middle of the DataFrame? Try subsetting the 100th through 105th row. Hint: Don’t forget that counting starts at 0 in Python!

# subset the 100th row through the 105th row
tips.iloc[99:105]
total_bill tip sex smoker day time size
99 12.46 1.50 Male No Fri Dinner 2
100 11.35 2.50 Female Yes Fri Dinner 2
101 15.38 3.00 Female Yes Fri Dinner 2
102 44.30 2.50 Female Yes Sat Dinner 3
103 22.42 3.48 Female Yes Sat Dinner 2
104 20.92 4.08 Female No Sat Dinner 2

Congrats on making it to the end of this lesson – we learned a lot!