# import the pandas package
import pandas as pd
# set the path
path = 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
# load tips
tips = pd.read_csv(path + 'SampleData/tips.csv')Subsetting Pandas DataFrames II
In the last lesson, you learned how to subset dataframes by entire rows or entire columns. Now we’re going to learn how to do both at the same time!
Let’s read in the tips dataset again:
Take a look again at the beginning of the tips DataFrame:
# view the beginning of tips
tips.head()| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
First, let’s recall how you would subset three columns, total_bill, day, and time. Let’s save it to a variable called subset1.
# subset the colums and save it to subset1
columns = ['total_bill', 'day', 'time']
subset1 = tips[columns]
#subset1 = tips[['total_bill', 'day', 'time']] # alternative method
print(subset1) total_bill day time
0 16.99 Sun Dinner
1 10.34 Sun Dinner
2 21.01 Sun Dinner
3 23.68 Sun Dinner
4 24.59 Sun Dinner
.. ... ... ...
239 29.03 Sat Dinner
240 27.18 Sat Dinner
241 22.67 Sat Dinner
242 17.82 Sat Dinner
243 18.78 Thur Dinner
[244 rows x 3 columns]
Now, how would you subset the 6th row through the 10th row from the subset1 dataframe? Let’s save it to a variable called subset2.
# subset the 6th row through the 10th row and save it to tips_subset_rows
subset2 = subset1.iloc[5:10]
print(subset2) total_bill day time
5 25.29 Sun Dinner
6 8.77 Sun Dinner
7 26.88 Sun Dinner
8 15.04 Sun Dinner
9 14.78 Sun Dinner
Now the subset2 dataframe has just the rows 5 through 10 and three columns. We can even subset rows and columns in the same line of code, instead of doing it on multiple lines like we did above. Let’s try combining both iloc and square brackets [] on one line:
# subset rows & columns at the same time
subset3 = tips.iloc[5:10][['total_bill', 'day', 'time']]
print(subset3) total_bill day time
5 25.29 Sun Dinner
6 8.77 Sun Dinner
7 26.88 Sun Dinner
8 15.04 Sun Dinner
9 14.78 Sun Dinner
What do you notice about subset3? How does it compare to subset2?
# compare subset2 and subset3
subset2 == subset3
# every value printed out is True, so they're exactly the same| total_bill | day | time | |
|---|---|---|---|
| 5 | True | True | True |
| 6 | True | True | True |
| 7 | True | True | True |
| 8 | True | True | True |
| 9 | True | True | True |
Now you try! Subset rows 11 and 12 and columns total_bill and tip on one line of code:
# subset rows and columns
tips.iloc[10:12][['total_bill', 'tip']]| total_bill | tip | |
|---|---|---|
| 10 | 10.27 | 1.71 |
| 11 | 35.26 | 5.00 |
Sometimes we don’t know exactly which row(s) we want to subset ahead of time. What if we want to subset rows that have a certain value in the time column? We don’t want to scroll through hundreds of rows to find them. The good news is: we don’t have to! Let’s use the method called query. Inside the parentheses of query we’ll enclose a statement in quotes with the name of the column and an expression.
tips.query('time == "Lunch"')| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 77 | 27.20 | 4.00 | Male | No | Thur | Lunch | 4 |
| 78 | 22.76 | 3.00 | Male | No | Thur | Lunch | 2 |
| 79 | 17.29 | 2.71 | Male | No | Thur | Lunch | 2 |
| 80 | 19.44 | 3.00 | Male | Yes | Thur | Lunch | 2 |
| 81 | 16.66 | 3.40 | Male | No | Thur | Lunch | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 222 | 8.58 | 1.92 | Male | Yes | Fri | Lunch | 1 |
| 223 | 15.98 | 3.00 | Female | No | Fri | Lunch | 3 |
| 224 | 13.42 | 1.58 | Male | Yes | Fri | Lunch | 2 |
| 225 | 16.27 | 2.50 | Female | Yes | Fri | Lunch | 2 |
| 226 | 10.09 | 2.00 | Female | Yes | Fri | Lunch | 2 |
68 rows × 7 columns
The above cell showed us all the rows where time is equal to “Lunch”. We had to enclose “Lunch” in quotes above because it’s not the name of a column, but a value within the time column.
Now you try: subset the rows where the waitress is female and save it to a variable, female:
# subset rows with a female waitress and save it to a variable
female = tips.query('sex == "Female"')
# take a look at the beginning
female.head()| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| 11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
| 14 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
| 16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
Now lets do the same for males. Subset the male waiter data and save it to a variable, male:
# subset the male waiters and save it
male = tips.query('sex == "Male"')
# look at the beginning
male.head()| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
| 6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
How would you determine the number of dining tables with male waiters in this DataFrame? Think back to the last lesson when we used the len function.
# number of males
len(male)157
How about the number of dining tables with female waitresses?
# number of females
len(female)87
We can use query on multiple columns at a time. Let’s find out how many tables were served by a female waitress on a Sunday.
tips.query('sex == "Female" and day == "Sun"')| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| 11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
| 14 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
| 16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
| 18 | 16.97 | 3.50 | Female | No | Sun | Dinner | 3 |
| 51 | 10.29 | 2.60 | Female | No | Sun | Dinner | 2 |
| 52 | 34.81 | 5.20 | Female | No | Sun | Dinner | 4 |
| 114 | 25.71 | 4.00 | Female | No | Sun | Dinner | 3 |
| 115 | 17.31 | 3.50 | Female | No | Sun | Dinner | 2 |
| 155 | 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
| 157 | 25.00 | 3.75 | Female | No | Sun | Dinner | 4 |
| 158 | 13.39 | 2.61 | Female | No | Sun | Dinner | 2 |
| 162 | 16.21 | 2.00 | Female | No | Sun | Dinner | 3 |
| 164 | 17.51 | 3.00 | Female | Yes | Sun | Dinner | 2 |
| 178 | 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 |
| 186 | 20.90 | 3.50 | Female | Yes | Sun | Dinner | 3 |
| 188 | 18.15 | 3.50 | Female | Yes | Sun | Dinner | 3 |
We used the keyword and to chain together two statements inside the query function. Both statements have to be true for a row to be included.
Besides checking whether values are equal using ==, we can also use greater than, less than, greater than or equal, etc. Try subsetting the rows where the bill is greater than \(\$15\) and the tip is less than \(\$2\):
# subset by bill and tip
tips.query('total_bill > 15 and tip < 2')| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 8 | 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
| 12 | 15.42 | 1.57 | Male | No | Sun | Dinner | 2 |
| 57 | 26.41 | 1.50 | Female | No | Sat | Dinner | 2 |
| 105 | 15.36 | 1.64 | Male | Yes | Sat | Dinner | 2 |
| 130 | 19.08 | 1.50 | Male | No | Thur | Lunch | 2 |
| 146 | 18.64 | 1.36 | Female | No | Thur | Lunch | 3 |
| 190 | 15.69 | 1.50 | Male | Yes | Sun | Dinner | 2 |
| 237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
Instead of and we can use the keyword or to represent a query where one of the two conditions must be fulfilled. Try subsetting where the bill is greater than \(\$15\) or the tip is greater than \(\$5\):
# subset by bill or tip
tips.query('total_bill > 15 or tip > 5')| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| 5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
165 rows × 7 columns
In this lesson, you learned:
- How to use
ilocand square brackets[]at the same time. - How to use
queryto find rows where the column has a certain value.