Bar Charts and Histograms

Bar Charts

Bar charts are used to display how a categorical variable relates to a continuous variable. In bar charts the categorical varibale is displayed on the x-axis and the continuous variable is displayed on the y-axis.

Categorical variables are variables with different categories or groups.
- Examples: gender, city
Continuous variables are numeric variables.
- Examples: time, height, length

# import seaborn, matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
# set up inline figures
%matplotlib inline

We will be using the titanic dataset in this example. Let’s load and preview it.

# read in titanic data
titanic = sns.load_dataset("titanic")
# preview data
titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Let’s say we want to compare the mean fare price across the three classes of tickets for all passengers.

# barplot of class vs fare
sns.barplot(x="class", y = 'fare', data=titanic)

Notice how seaborn magically computes the mean fares and generates the plot exactly as we want without us even specifying!

What if we wanted to look at the data more granularly and further stratify each class bar by the sex variable? Based on what you know about seaborn so far, how do you think we can do that?

# barplot of class vs fare stratified by sex
sns.barplot(x="class", y = 'fare', hue = "sex", data=titanic)

Histograms

Histograms are used to visualize the distribution of a continuous variable.

Let’s say we wanted to see how the age was distributed across all passengers in our dataset. We can use the distplot function to generate our histogram.

# histogram of age
sns.distplot(titanic['age'].dropna(), kde=False)

We can change the number of bins used to plot our histogram to change the granularity of our distribution plot.

# histogram of age
sns.distplot(titanic['age'].dropna(), kde=False, bins=10)

# histogram of age
sns.distplot(titanic['age'].dropna(), kde=False, bins=80)

Unfortunately we can’t color our histograms by another variable, but we can compare the distributions of certain variables between subsets of our DataFrame by layering them.

# histogram of age for females
sns.distplot(titanic.query('sex == "female"')['age'].dropna(), kde=False, label="F")
sns.distplot(titanic.query('sex == "male"')['age'].dropna(), kde=False, label="M")
plt.legend()

Count Plots

Count plots can be thought of as histograms for categorical variables.

Let’s say we wanted to visualize how many passengers there were in each class.

# count plot of class
sns.countplot(x="class", data=titanic)

Now, let’s stratify each class by the sex variable using color. By now you’re an expert in this!

# stratify class by sex variable
sns.countplot(x="class", hue = "sex",  data=titanic)

As always we can change the color palette:

# change color palette
sns.countplot(x="class", hue = "sex", palette = "Set3", data=titanic)

In this lesson you learned: * How to create barplots in seaborn * How to stratify barplots by another variable using color (hue) * How to create histograms in seaborn * Changing the granularity of the histograms (bins) * How to create count plots in seaborn * How to stratify count plots by another variable using color (hue)