Basic Statistics I: Percents

A percentage is a number or ratio expressed as a fraction of 100. We’ll do some examples together to learn how to calculate percentages.

Example 1: For a basket of 18 fruits, there are 5 apples, 3 bananas, 6 peaches, and 4 oranges.

What percentage of fruits are apples?

# Calculate percentage for apples
5/18*100

What percentage of fruits are oranges and peaches?

# Calculate percentage for oranges and peaches
(4+6)/18*100

Example 2: Let’s learn to calculate percentages by using real world data. We will work with a dataset of Ames, Iowa housing prices.

# Import the fetch_openml method 
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True, parser="auto")
# Import pandas, so that we can work with the data frame version of the Ames housing data
import pandas as pd
# Load the dataset of house prices in Ames, and convert to
# a data frame format so it's easier to view and process
ames_df = pd.DataFrame(housing['data'], columns = housing['feature_names'])
ames_df['SalePrice'] = housing.target
ames_df

The SaleCondition column lists the condition of the house sale:

What percentage of the houses were sold normally? We’ll see how to do this using the query method AND using boolean indexing.

# Determine number of tracts that bound the Charles River two ways:
# (1) with the query function
num_normal = len(ames_df.query("SaleCondition == 'Normal'"))
num_normal
# (2) using boolean indexing
num_normal = sum(ames_df["SaleCondition"] == "Normal")
num_normal

How do these two methods give the same answer?

# Determine the total number of houses in the dataset
total_num = len(ames_df)

# Now calculate the percentage of houses sold normally.
num_normal/total_num*100

What percentage of houses have a price less than $200,000?

# Determine number of houses that cost less than $200,000
num_cost_less_200k = sum(ames_df["SalePrice"] < 200000)

# Calculate the percentage of houses that cost less than $200k.
num_cost_less_200k/total_num*100

What percentage of houses have a sale price between $200,000 and $500,000?

# Make an array of booleans with cost greater than $200,000 AND less than $500,000
between_200k_and_500k = (ames_df["SalePrice"] > 200000) & (ames_df["SalePrice"] < 500000)

# Determine number of houses that cost between $200,000 and $500,000
num_between_200k_and_500k = sum(between_200k_and_500k)

# Calculate the percentage of houses between $200,000 and $500,000
num_between_200k_and_500k/total_num*100

Good work! You just learned about how to calculate percentages in Python!