import numpy as np
Basic Statistics I: Averages
An average is the central value of a set of numbers.
The arithmetic mean is the sum of the elements along the axis divided by the number of elements.
# Make an array of rank 1
= np.array([1, 2, 3, 4, 5]) arr
Let’s manually calculate the average of array arr.
# Manually calculate the average
= np.sum(arr)/len(arr)
average_manual print(average_manual)
3.0
Now, to make our life easier, let’s use the built-in method mean to calculate the average.
# Calculate the average using the built-in method mean
= np.mean(arr)
average_numpy print(average_numpy)
3.0
We can also calculate averages on arrays of rank 2.
# Make an array of rank 2
= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) a
# Determine the mean of the array of rank 2
print(np.mean(a))
6.5
# Determine the mean of each column
= np.mean(a, axis = 0)
averages_columnwise print(averages_columnwise)
[ 5.5 6.5 7.5]
# Determine the mean of each row
= np.mean(a, axis = 1)
averages_rowwise print(averages_rowwise)
[ 2. 5. 8. 11.]
We have been practicing on simulated data, so let’s now world with a real-world dataset by using the iris dataset.
# Import the load_iris method
from sklearn.datasets import load_iris
# Import pandas, so that we can work with the data frame version of the iris data
import pandas as pd
# Load the iris data
= load_iris() iris
# Convert the iris data to a data frame format, so that it's easier to view
# and process
= pd.DataFrame(iris['data'], columns = iris['feature_names'])
iris_df iris_df
# Determine the mean of each feature
= np.mean(iris_df, axis = 0)
averages_column print(averages_column)
sepal length (cm) 5.843333
sepal width (cm) 3.054000
petal length (cm) 3.758667
petal width (cm) 1.198667
dtype: float64
So we can determine the averages by row, but should we do this? Why or why not?
# Determine the mean of each row
= np.mean(iris_df, axis = 1)
averages_row print(averages_row)
How should we interpret a value of averages_row? It’s hard to interpret these values, since taking an average across different features does not make sense.
Even though we can calculate any statistics that we want, some statistics may not be interpretable. So be careful on your calculations!
Great work! You just learned about how to take averages in Python! You learned:
- To manually and automatically calculate averages
- To calculate averages by row and by columns.
- To calculate averages on a real dataset.
- To know when it is appropriate to calculate row-wise or column-wise averages.