Subsetting Pandas DataFrames II

In the last lesson, you learned how to subset dataframes by entire rows or entire columns. Now we’re going to learn how to do both at the same time!

Let’s read in the tips dataset again:

# import the pandas package
import pandas as pd
# set the path
path = 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
# load tips
tips = pd.read_csv(path + 'SampleData/tips.csv')

Take a look again at the beginning of the tips DataFrame:

# view the beginning of tips
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

First, let’s recall how you would subset three columns, total_bill, day, and time. Let’s save it to a variable called subset1.

# subset the colums and save it to subset1
columns = ['total_bill', 'day', 'time']
subset1 = tips[columns]
#subset1 = tips[['total_bill', 'day', 'time']] # alternative method
print(subset1)
     total_bill   day    time
0         16.99   Sun  Dinner
1         10.34   Sun  Dinner
2         21.01   Sun  Dinner
3         23.68   Sun  Dinner
4         24.59   Sun  Dinner
..          ...   ...     ...
239       29.03   Sat  Dinner
240       27.18   Sat  Dinner
241       22.67   Sat  Dinner
242       17.82   Sat  Dinner
243       18.78  Thur  Dinner

[244 rows x 3 columns]

Now, how would you subset the 6th row through the 10th row from the subset1 dataframe? Let’s save it to a variable called subset2.

# subset the 6th row through the 10th row and save it to tips_subset_rows
subset2 = subset1.iloc[5:10]
print(subset2)
   total_bill  day    time
5       25.29  Sun  Dinner
6        8.77  Sun  Dinner
7       26.88  Sun  Dinner
8       15.04  Sun  Dinner
9       14.78  Sun  Dinner

Now the subset2 dataframe has just the rows 5 through 10 and three columns. We can even subset rows and columns in the same line of code, instead of doing it on multiple lines like we did above. Let’s try combining both iloc and square brackets [] on one line:

# subset rows & columns at the same time
subset3 = tips.iloc[5:10][['total_bill', 'day', 'time']]
print(subset3)
   total_bill  day    time
5       25.29  Sun  Dinner
6        8.77  Sun  Dinner
7       26.88  Sun  Dinner
8       15.04  Sun  Dinner
9       14.78  Sun  Dinner

What do you notice about subset3? How does it compare to subset2?

# compare subset2 and subset3
subset2 == subset3

# every value printed out is True, so they're exactly the same
total_bill day time
5 True True True
6 True True True
7 True True True
8 True True True
9 True True True

Now you try! Subset rows 11 and 12 and columns total_bill and tip on one line of code:

# subset rows and columns
tips.iloc[10:12][['total_bill', 'tip']]
total_bill tip
10 10.27 1.71
11 35.26 5.00

Sometimes we don’t know exactly which row(s) we want to subset ahead of time. What if we want to subset rows that have a certain value in the time column? We don’t want to scroll through hundreds of rows to find them. The good news is: we don’t have to! Let’s use the method called query. Inside the parentheses of query we’ll enclose a statement in quotes with the name of the column and an expression.

tips.query('time == "Lunch"')
total_bill tip sex smoker day time size
77 27.20 4.00 Male No Thur Lunch 4
78 22.76 3.00 Male No Thur Lunch 2
79 17.29 2.71 Male No Thur Lunch 2
80 19.44 3.00 Male Yes Thur Lunch 2
81 16.66 3.40 Male No Thur Lunch 2
... ... ... ... ... ... ... ...
222 8.58 1.92 Male Yes Fri Lunch 1
223 15.98 3.00 Female No Fri Lunch 3
224 13.42 1.58 Male Yes Fri Lunch 2
225 16.27 2.50 Female Yes Fri Lunch 2
226 10.09 2.00 Female Yes Fri Lunch 2

68 rows × 7 columns

The above cell showed us all the rows where time is equal to “Lunch”. We had to enclose “Lunch” in quotes above because it’s not the name of a column, but a value within the time column.

Now you try: subset the rows where the waitress is female and save it to a variable, female:

# subset rows with a female waitress and save it to a variable
female = tips.query('sex == "Female"')

# take a look at the beginning
female.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3

Now lets do the same for males. Subset the male waiter data and save it to a variable, male:

# subset the male waiters and save it
male = tips.query('sex == "Male"')

# look at the beginning
male.head()
total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2

How would you determine the number of dining tables with male waiters in this DataFrame? Think back to the last lesson when we used the len function.

# number of males
len(male)
157

How about the number of dining tables with female waitresses?

# number of females
len(female)
87

We can use query on multiple columns at a time. Let’s find out how many tables were served by a female waitress on a Sunday.

tips.query('sex == "Female" and day == "Sun"')
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
11 35.26 5.00 Female No Sun Dinner 4
14 14.83 3.02 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
18 16.97 3.50 Female No Sun Dinner 3
51 10.29 2.60 Female No Sun Dinner 2
52 34.81 5.20 Female No Sun Dinner 4
114 25.71 4.00 Female No Sun Dinner 3
115 17.31 3.50 Female No Sun Dinner 2
155 29.85 5.14 Female No Sun Dinner 5
157 25.00 3.75 Female No Sun Dinner 4
158 13.39 2.61 Female No Sun Dinner 2
162 16.21 2.00 Female No Sun Dinner 3
164 17.51 3.00 Female Yes Sun Dinner 2
178 9.60 4.00 Female Yes Sun Dinner 2
186 20.90 3.50 Female Yes Sun Dinner 3
188 18.15 3.50 Female Yes Sun Dinner 3

We used the keyword and to chain together two statements inside the query function. Both statements have to be true for a row to be included.

Besides checking whether values are equal using ==, we can also use greater than, less than, greater than or equal, etc. Try subsetting the rows where the bill is greater than \(\$15\) and the tip is less than \(\$2\):

# subset by bill and tip
tips.query('total_bill > 15 and tip < 2')
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
8 15.04 1.96 Male No Sun Dinner 2
12 15.42 1.57 Male No Sun Dinner 2
57 26.41 1.50 Female No Sat Dinner 2
105 15.36 1.64 Male Yes Sat Dinner 2
130 19.08 1.50 Male No Thur Lunch 2
146 18.64 1.36 Female No Thur Lunch 3
190 15.69 1.50 Male Yes Sun Dinner 2
237 32.83 1.17 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2

Instead of and we can use the keyword or to represent a query where one of the two conditions must be fulfilled. Try subsetting where the bill is greater than \(\$15\) or the tip is greater than \(\$5\):

# subset by bill or tip
tips.query('total_bill > 15 or tip > 5')
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

165 rows × 7 columns

In this lesson, you learned: