# Import pandas, so that we can import the diabetes dataset and work with the data frame version of this data
import pandas as pd
Lesson 24: Statistical Significance
An introduction to hypothesis testing
Statistical significance is one of the most important parts of statistics—it’s what allows us to make conclusions about our data.
But first, what does statistical significance mean? - Definition: statistical significance is the likelihood that the difference between a given variation and the baseline is not due to random chance. Statistical significance is calculated using different mathematical formulas, which we’ll see later in the lesson.
We can determine if our difference is statistically significant by comparing our calculated significance value to the significance level. - Definition: statistical significance level is the level at which one can accept whether an event or difference is statistically significant, or not due to random chance. This term is denoted as
Next, when is statistical significance most practically used? - It is used in statistical hypothesis testing. - For example, you want to know whether or not having a healthier diet will result in lower levels of C-Reactive protein, and hence fewer incidents of infection and inflammation.
What is hypothesis testing? - Definition: hypothesis testing is the use of statistic to determine the probability that a given hypothesis is true. - There are two types of statistical hypotheses. - Definition: The null hypothesis, denoted by
What is a t-test?
Let’s apply these definitions to our diet and C-Reactive Protein example. We want to know whether or not having a healthier diet will result in lower levels of C-Reactive protein, and hence fewer incidents of infection and inflammation. - What is the null hypothesis? - Answer: Having a healthier diet will have zero effect on the levels of C-Reactive protein.
- What is the alternative hypothesis?
- Answer: Having a healthier diet will lower the levels of C-Reactive protein.
There are so many statistical tests that are appropriate for use if certain assumptions are met. But for this lesson, we will focus on the Student’s t-test. - Here a link to all statistical tests: https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/.
Let’s start with a Student’s t-test. This test will assess whether samples from two independent population provide evidence that the populations have different means.
Another aspect of hypothesis testing is the “use of a statistic”. What is this statistic (test statistic)? - Definition: A test statistic is a quantity derived from the sample and is used in statistical hypothesis testing to determine whether the null hypothesis should be accepted or rejected. - There are a wide variety of test statistics to use for a given problem. But as we are using a Student’s t-test, we will focus on the t-statistic.
Now that we have calculated our t-statistic, what do we do with it to reject or fail to reject the null hypothesis? - We need to compare our t-statistic to a critical value. - Definition: a critical value is a point on the test distribution that is compared to the test statistic to determine whether to fail to reject or reject the null hypothesis.
What is the critical value in the t-distribution when conducting a Student’s t-test?
The absolute value of the critical value depends on the confidence level (A), significance level (P), and the degree of freedom (DF), which is the sample size - 1 or
What are the critical value(s) for when the significance level is 0.05 and the sample size approaches infinity? - Answer: 1.96 If the absolute value of our t-statistic is greater than our critical value, then we can reject the null hypothesis.
Finally, what is a p-value? - Definition: a p-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the p-value, the stronger the evidence against the null hypothesis.
The procedure of hypothesis testing involves four steps: 1. Formulate the null hypothesis and the alternative hypothesis. 2. Identify and compute a test statistic that can be used to reject or fail to reject the null hypothesis. Check your assumptions! 3. Compute the test statistic and p-value 4. Compare the p-value to an acceptable significance value,
Now, let’s try it ourselves!
We will work with the diabetes dataset to learn how to apply the procedure of hypothesis testing to real world data. - This dataset contains 442 diabetes patients with data on age (AGE), sex (SEX), body mass index (BMI), mean arterial blood pressure (MAP), six blood serum measurements (TC, LDL, HDL, TCH, LTG, and GLU), and a quantitative measure of disease progression (Y).
# Set the path
= 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
path # This is where the file is located
= path + 'SampleData/diabetes.csv' filename
# Load the diabetes dataset into a DataFrame
= pd.read_csv(filename)
diabetes_df diabetes_df
AGE | SEX | BMI | MAP | TC | LDL | HDL | TCH | LTG | GLU | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 59 | 2 | 32.1 | 101.00 | 157 | 93.2 | 38.0 | 4.00 | 4.8598 | 87 | 151 |
1 | 48 | 1 | 21.6 | 87.00 | 183 | 103.2 | 70.0 | 3.00 | 3.8918 | 69 | 75 |
2 | 72 | 2 | 30.5 | 93.00 | 156 | 93.6 | 41.0 | 4.00 | 4.6728 | 85 | 141 |
3 | 24 | 1 | 25.3 | 84.00 | 198 | 131.4 | 40.0 | 5.00 | 4.8903 | 89 | 206 |
4 | 50 | 1 | 23.0 | 101.00 | 192 | 125.4 | 52.0 | 4.00 | 4.2905 | 80 | 135 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
437 | 60 | 2 | 28.2 | 112.00 | 185 | 113.8 | 42.0 | 4.00 | 4.9836 | 93 | 178 |
438 | 47 | 2 | 24.9 | 75.00 | 225 | 166.0 | 42.0 | 5.00 | 4.4427 | 102 | 104 |
439 | 60 | 2 | 24.9 | 99.67 | 162 | 106.6 | 43.0 | 3.77 | 4.1271 | 95 | 132 |
440 | 36 | 1 | 30.0 | 95.00 | 201 | 125.2 | 42.0 | 4.79 | 5.1299 | 85 | 220 |
441 | 36 | 1 | 19.6 | 71.00 | 250 | 133.2 | 97.0 | 3.00 | 4.5951 | 92 | 57 |
442 rows × 11 columns
We are interested in understanding whether there are differences in diabetes progression by sex, i.e. is the disease progression different for males vs. females?
1. Formulate the null hypothesis and the alternative hypothesis. - Null hypothesis: There is NO difference in disease progression between male and female. - Alternative hypothesis: There is a difference in disease progression by sex.
# Import numpy
import numpy as np
# Look at all unique values for sex
"SEX"]) np.unique(diabetes_df[
array([1, 2])
Males are indicated by “1” for the variable “SEX”, while females are indicated by “2”.
# Define a vector of the disease progression for males and name it progression_male
= diabetes_df.query('SEX == 1')
diabetes_male = diabetes_male['Y']
progression_male
# Define a vector of the disease progression for females and name it progression_female
= diabetes_df.query('SEX == 2')
diabetes_female = diabetes_female['Y'] progression_female
2. Identify and compute a test statistic that can be used to reject or fail to reject the null hypothesis. - As we are working with two independent samples, we will use the two-sample t-test and use the t-statistic.
3. Compute the test statistic and p-value.
# Import stats methods to help calculate the t-statistic and p-value
from scipy import stats
# Run a Student's t-test with the method ttest_ind
= stats.ttest_ind(progression_male, progression_female)
t_statistic, p_value
# Print out the test statistic and p-value
print("t-statistic = " + str(t_statistic))
print("p-value = " + str(p_value))
t-statistic = -0.9041147550244715
p-value = 0.3664292946519826
4. Compare the p-value to an acceptable significance value,
Congratulations! You know how to conduct hypothesis testing with a Student’s t-test!
Misconceptions about statistical significance: 1. A low p-values implies a large effect. - Proper interpretation: A low p-value indicates that the outcome would be highly unlikely if the null hypothesis were true. A lower p-value does not usually equate to a large effect. There are cases when a low p-value can occur with a small effect. 2. A non-significant outcome (AKA high p-value) means that the null hypothesis is probably true. - Proper interpretation: A non-significant outcome (AKA high p-value) means that the data do not conclusively demonstrate that the null hypothesis is false. This is why we should say, “When the p-value > 0.05, we fail to reject the null hypothesis.” We should not say that we accept the null hypothesis when the p-value > 0.05.
Awesome work! You just learned about statistical significance! You learned:
- Important definitions such as statistical significance, statistical significance level, null hypothesis, alternative hypothesis, test statistic, p-value, and critical values.
- To conduct hypothesis testing.
- To determine critical values to compare with your own test statistic in order to decide whether a variable has an effect on the outcome of interest.
- To implement a Student’s t-test.