# import the pandas package
import pandas as pd
# set the path
= 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
path # load tips
= pd.read_csv(path + 'SampleData/tips.csv') tips
Subsetting Pandas DataFrames II
In the last lesson, you learned how to subset dataframes by entire rows or entire columns. Now we’re going to learn how to do both at the same time!
Let’s read in the tips
dataset again:
Take a look again at the beginning of the tips
DataFrame
:
# view the beginning of tips
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
First, let’s recall how you would subset three columns, total_bill
, day
, and time
. Let’s save it to a variable called subset1
.
# subset the colums and save it to subset1
= ['total_bill', 'day', 'time']
columns = tips[columns]
subset1 #subset1 = tips[['total_bill', 'day', 'time']] # alternative method
print(subset1)
total_bill day time
0 16.99 Sun Dinner
1 10.34 Sun Dinner
2 21.01 Sun Dinner
3 23.68 Sun Dinner
4 24.59 Sun Dinner
.. ... ... ...
239 29.03 Sat Dinner
240 27.18 Sat Dinner
241 22.67 Sat Dinner
242 17.82 Sat Dinner
243 18.78 Thur Dinner
[244 rows x 3 columns]
Now, how would you subset the 6th row through the 10th row from the subset1
dataframe? Let’s save it to a variable called subset2
.
# subset the 6th row through the 10th row and save it to tips_subset_rows
= subset1.iloc[5:10]
subset2 print(subset2)
total_bill day time
5 25.29 Sun Dinner
6 8.77 Sun Dinner
7 26.88 Sun Dinner
8 15.04 Sun Dinner
9 14.78 Sun Dinner
Now the subset2
dataframe has just the rows 5 through 10 and three columns. We can even subset rows and columns in the same line of code, instead of doing it on multiple lines like we did above. Let’s try combining both iloc
and square brackets []
on one line:
# subset rows & columns at the same time
= tips.iloc[5:10][['total_bill', 'day', 'time']]
subset3 print(subset3)
total_bill day time
5 25.29 Sun Dinner
6 8.77 Sun Dinner
7 26.88 Sun Dinner
8 15.04 Sun Dinner
9 14.78 Sun Dinner
What do you notice about subset3
? How does it compare to subset2
?
# compare subset2 and subset3
== subset3
subset2
# every value printed out is True, so they're exactly the same
total_bill | day | time | |
---|---|---|---|
5 | True | True | True |
6 | True | True | True |
7 | True | True | True |
8 | True | True | True |
9 | True | True | True |
Now you try! Subset rows 11 and 12 and columns total_bill
and tip
on one line of code:
# subset rows and columns
10:12][['total_bill', 'tip']] tips.iloc[
total_bill | tip | |
---|---|---|
10 | 10.27 | 1.71 |
11 | 35.26 | 5.00 |
Sometimes we don’t know exactly which row(s) we want to subset ahead of time. What if we want to subset rows that have a certain value in the time
column? We don’t want to scroll through hundreds of rows to find them. The good news is: we don’t have to! Let’s use the method
called query
. Inside the parentheses of query
we’ll enclose a statement in quotes with the name of the column and an expression.
'time == "Lunch"') tips.query(
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
77 | 27.20 | 4.00 | Male | No | Thur | Lunch | 4 |
78 | 22.76 | 3.00 | Male | No | Thur | Lunch | 2 |
79 | 17.29 | 2.71 | Male | No | Thur | Lunch | 2 |
80 | 19.44 | 3.00 | Male | Yes | Thur | Lunch | 2 |
81 | 16.66 | 3.40 | Male | No | Thur | Lunch | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
222 | 8.58 | 1.92 | Male | Yes | Fri | Lunch | 1 |
223 | 15.98 | 3.00 | Female | No | Fri | Lunch | 3 |
224 | 13.42 | 1.58 | Male | Yes | Fri | Lunch | 2 |
225 | 16.27 | 2.50 | Female | Yes | Fri | Lunch | 2 |
226 | 10.09 | 2.00 | Female | Yes | Fri | Lunch | 2 |
68 rows × 7 columns
The above cell showed us all the rows where time
is equal to “Lunch”. We had to enclose “Lunch” in quotes above because it’s not the name of a column, but a value within the time
column.
Now you try: subset the rows where the waitress is female and save it to a variable, female
:
# subset rows with a female waitress and save it to a variable
= tips.query('sex == "Female"')
female
# take a look at the beginning
female.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
14 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
Now lets do the same for males. Subset the male waiter data and save it to a variable, male
:
# subset the male waiters and save it
= tips.query('sex == "Male"')
male
# look at the beginning
male.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
How would you determine the number of dining tables with male waiters in this DataFrame
? Think back to the last lesson when we used the len
function.
# number of males
len(male)
157
How about the number of dining tables with female waitresses?
# number of females
len(female)
87
We can use query
on multiple columns at a time. Let’s find out how many tables were served by a female waitress on a Sunday.
'sex == "Female" and day == "Sun"') tips.query(
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
14 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
18 | 16.97 | 3.50 | Female | No | Sun | Dinner | 3 |
51 | 10.29 | 2.60 | Female | No | Sun | Dinner | 2 |
52 | 34.81 | 5.20 | Female | No | Sun | Dinner | 4 |
114 | 25.71 | 4.00 | Female | No | Sun | Dinner | 3 |
115 | 17.31 | 3.50 | Female | No | Sun | Dinner | 2 |
155 | 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
157 | 25.00 | 3.75 | Female | No | Sun | Dinner | 4 |
158 | 13.39 | 2.61 | Female | No | Sun | Dinner | 2 |
162 | 16.21 | 2.00 | Female | No | Sun | Dinner | 3 |
164 | 17.51 | 3.00 | Female | Yes | Sun | Dinner | 2 |
178 | 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 |
186 | 20.90 | 3.50 | Female | Yes | Sun | Dinner | 3 |
188 | 18.15 | 3.50 | Female | Yes | Sun | Dinner | 3 |
We used the keyword and
to chain together two statements inside the query
function. Both statements have to be true for a row to be included.
Besides checking whether values are equal using ==, we can also use greater than, less than, greater than or equal, etc. Try subsetting the rows where the bill is greater than \(\$15\) and the tip is less than \(\$2\):
# subset by bill and tip
'total_bill > 15 and tip < 2') tips.query(
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
8 | 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
12 | 15.42 | 1.57 | Male | No | Sun | Dinner | 2 |
57 | 26.41 | 1.50 | Female | No | Sat | Dinner | 2 |
105 | 15.36 | 1.64 | Male | Yes | Sat | Dinner | 2 |
130 | 19.08 | 1.50 | Male | No | Thur | Lunch | 2 |
146 | 18.64 | 1.36 | Female | No | Thur | Lunch | 3 |
190 | 15.69 | 1.50 | Male | Yes | Sun | Dinner | 2 |
237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
Instead of and
we can use the keyword or
to represent a query where one of the two conditions must be fulfilled. Try subsetting where the bill is greater than \(\$15\) or the tip is greater than \(\$5\):
# subset by bill or tip
'total_bill > 15 or tip > 5') tips.query(
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
165 rows × 7 columns
In this lesson, you learned:
- How to use
iloc
and square brackets[]
at the same time. - How to use
query
to find rows where the column has a certain value.