Basic Statistics I: Averages

An average is the central value of a set of numbers.

The arithmetic mean is the sum of the elements along the axis divided by the number of elements.

import numpy as np
# Make an array of rank 1
arr = np.array([1, 2, 3, 4, 5])

Let’s manually calculate the average of array arr.

# Manually calculate the average
average_manual = np.sum(arr)/len(arr)
print(average_manual)
3.0

Now, to make our life easier, let’s use the built-in method mean to calculate the average.

# Calculate the average using the built-in method mean
average_numpy = np.mean(arr)
print(average_numpy)
3.0

We can also calculate averages on arrays of rank 2.

# Make an array of rank 2
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Determine the mean of the array of rank 2
print(np.mean(a))
6.5
# Determine the mean of each column
averages_columnwise = np.mean(a, axis = 0)
print(averages_columnwise)
[ 5.5  6.5  7.5]
# Determine the mean of each row
averages_rowwise = np.mean(a, axis = 1)
print(averages_rowwise)
[  2.   5.   8.  11.]

We have been practicing on simulated data, so let’s now world with a real-world dataset by using the iris dataset.

# Import the load_iris method 
from sklearn.datasets import load_iris
# Import pandas, so that we can work with the data frame version of the iris data
import pandas as pd
# Load the iris data
iris = load_iris()
# Convert the iris data to a data frame format, so that it's easier to view
# and process
iris_df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
iris_df
# Determine the mean of each feature
averages_column = np.mean(iris_df, axis = 0)
print(averages_column)
sepal length (cm)    5.843333
sepal width (cm)     3.054000
petal length (cm)    3.758667
petal width (cm)     1.198667
dtype: float64

So we can determine the averages by row, but should we do this? Why or why not?

# Determine the mean of each row
averages_row = np.mean(iris_df, axis = 1)
print(averages_row)

How should we interpret a value of averages_row? It’s hard to interpret these values, since taking an average across different features does not make sense.

Even though we can calculate any statistics that we want, some statistics may not be interpretable. So be careful on your calculations!

Great work! You just learned about how to take averages in Python! You learned: