Subsetting Pandas DataFrames II

In the last lesson, you learned how to subset dataframes by entire rows or entire columns. Now we’re going to learn how to do both at the same time!

Let’s read in the tips dataset again:

# import the pandas package
import pandas as pd
# set the path
path = 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
# load tips
tips = pd.read_csv(path + 'SampleData/tips.csv')

Take a look again at the beginning of the tips DataFrame:

# view the beginning of tips
tips.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

First, let’s recall how you would subset three columns, total_bill, day, and time. Let’s save it to a variable called subset1.

# subset the colums and save it to subset1
columns = ['total_bill', 'day', 'time']
subset1 = tips[columns]
#subset1 = tips[['total_bill', 'day', 'time']] # alternative method
print(subset1)

     total_bill   day    time
0         16.99   Sun  Dinner
1         10.34   Sun  Dinner
2         21.01   Sun  Dinner
3         23.68   Sun  Dinner
4         24.59   Sun  Dinner
..          ...   ...     ...
239       29.03   Sat  Dinner
240       27.18   Sat  Dinner
241       22.67   Sat  Dinner
242       17.82   Sat  Dinner
243       18.78  Thur  Dinner

[244 rows x 3 columns]

Now, how would you subset the 6th row through the 10th row from the subset1 dataframe? Let’s save it to a variable called subset2.

# subset the 6th row through the 10th row and save it to tips_subset_rows
subset2 = subset1.iloc[5:10]
print(subset2)

   total_bill  day    time
5       25.29  Sun  Dinner
6        8.77  Sun  Dinner
7       26.88  Sun  Dinner
8       15.04  Sun  Dinner
9       14.78  Sun  Dinner

Now the subset2 dataframe has just the rows 5 through 10 and three columns. We can even subset rows and columns in the same line of code, instead of doing it on multiple lines like we did above. Let’s try combining both iloc and square brackets [] on one line:

# subset rows & columns at the same time
subset3 = tips.iloc[5:10][['total_bill', 'day', 'time']]
print(subset3)

   total_bill  day    time
5       25.29  Sun  Dinner
6        8.77  Sun  Dinner
7       26.88  Sun  Dinner
8       15.04  Sun  Dinner
9       14.78  Sun  Dinner

What do you notice about subset3? How does it compare to subset2?

# compare subset2 and subset3
subset2 == subset3

# every value printed out is True, so they're exactly the same

	total_bill	day	time
5	True	True	True
6	True	True	True
7	True	True	True
8	True	True	True
9	True	True	True

Now you try! Subset rows 11 and 12 and columns total_bill and tip on one line of code:

# subset rows and columns
tips.iloc[10:12][['total_bill', 'tip']]

	total_bill	tip
10	10.27	1.71
11	35.26	5.00

Sometimes we don’t know exactly which row(s) we want to subset ahead of time. What if we want to subset rows that have a certain value in the time column? We don’t want to scroll through hundreds of rows to find them. The good news is: we don’t have to! Let’s use the method called query. Inside the parentheses of query we’ll enclose a statement in quotes with the name of the column and an expression.

tips.query('time == "Lunch"')

	total_bill	tip	sex	smoker	day	time	size
77	27.20	4.00	Male	No	Thur	Lunch	4
78	22.76	3.00	Male	No	Thur	Lunch	2
79	17.29	2.71	Male	No	Thur	Lunch	2
80	19.44	3.00	Male	Yes	Thur	Lunch	2
81	16.66	3.40	Male	No	Thur	Lunch	2
...	...	...	...	...	...	...	...
222	8.58	1.92	Male	Yes	Fri	Lunch	1
223	15.98	3.00	Female	No	Fri	Lunch	3
224	13.42	1.58	Male	Yes	Fri	Lunch	2
225	16.27	2.50	Female	Yes	Fri	Lunch	2
226	10.09	2.00	Female	Yes	Fri	Lunch	2

68 rows × 7 columns

The above cell showed us all the rows where time is equal to “Lunch”. We had to enclose “Lunch” in quotes above because it’s not the name of a column, but a value within the time column.

Now you try: subset the rows where the waitress is female and save it to a variable, female:

# subset rows with a female waitress and save it to a variable
female = tips.query('sex == "Female"')

# take a look at the beginning
female.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
11	35.26	5.00	Female	No	Sun	Dinner	4
14	14.83	3.02	Female	No	Sun	Dinner	2
16	10.33	1.67	Female	No	Sun	Dinner	3

Now lets do the same for males. Subset the male waiter data and save it to a variable, male:

# subset the male waiters and save it
male = tips.query('sex == "Male"')

# look at the beginning
male.head()

	total_bill	tip	sex	smoker	day	time	size
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
5	25.29	4.71	Male	No	Sun	Dinner	4
6	8.77	2.00	Male	No	Sun	Dinner	2

How would you determine the number of dining tables with male waiters in this DataFrame? Think back to the last lesson when we used the len function.

# number of males
len(male)

How about the number of dining tables with female waitresses?

# number of females
len(female)

We can use query on multiple columns at a time. Let’s find out how many tables were served by a female waitress on a Sunday.

tips.query('sex == "Female" and day == "Sun"')

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
11	35.26	5.00	Female	No	Sun	Dinner	4
14	14.83	3.02	Female	No	Sun	Dinner	2
16	10.33	1.67	Female	No	Sun	Dinner	3
18	16.97	3.50	Female	No	Sun	Dinner	3
51	10.29	2.60	Female	No	Sun	Dinner	2
52	34.81	5.20	Female	No	Sun	Dinner	4
114	25.71	4.00	Female	No	Sun	Dinner	3
115	17.31	3.50	Female	No	Sun	Dinner	2
155	29.85	5.14	Female	No	Sun	Dinner	5
157	25.00	3.75	Female	No	Sun	Dinner	4
158	13.39	2.61	Female	No	Sun	Dinner	2
162	16.21	2.00	Female	No	Sun	Dinner	3
164	17.51	3.00	Female	Yes	Sun	Dinner	2
178	9.60	4.00	Female	Yes	Sun	Dinner	2
186	20.90	3.50	Female	Yes	Sun	Dinner	3
188	18.15	3.50	Female	Yes	Sun	Dinner	3

We used the keyword and to chain together two statements inside the query function. Both statements have to be true for a row to be included.

Besides checking whether values are equal using ==, we can also use greater than, less than, greater than or equal, etc. Try subsetting the rows where the bill is greater than $\$15$ and the tip is less than $\$2$:

# subset by bill and tip
tips.query('total_bill > 15 and tip < 2')

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
8	15.04	1.96	Male	No	Sun	Dinner	2
12	15.42	1.57	Male	No	Sun	Dinner	2
57	26.41	1.50	Female	No	Sat	Dinner	2
105	15.36	1.64	Male	Yes	Sat	Dinner	2
130	19.08	1.50	Male	No	Thur	Lunch	2
146	18.64	1.36	Female	No	Thur	Lunch	3
190	15.69	1.50	Male	Yes	Sun	Dinner	2
237	32.83	1.17	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2

Instead of and we can use the keyword or to represent a query where one of the two conditions must be fulfilled. Try subsetting where the bill is greater than $\$15$ or the tip is greater than $\$5$:

# subset by bill or tip
tips.query('total_bill > 15 or tip > 5')

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
5	25.29	4.71	Male	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

165 rows × 7 columns

In this lesson, you learned:

How to use iloc and square brackets [] at the same time.
How to use query to find rows where the column has a certain value.