Lesson 24: Statistical Significance

An introduction to hypothesis testing

Statistical significance is one of the most important parts of statistics—it’s what allows us to make conclusions about our data. But first, what does statistical significance mean? - Definition: statistical significance is the likelihood that the difference between a given variation and the baseline is not due to random chance. Statistical significance is calculated using different mathematical formulas, which we’ll see later in the lesson. We can determine if our difference is statistically significant by comparing our calculated significance value to the significance level. - Definition: statistical significance level is the level at which one can accept whether an event or difference is statistically significant, or not due to random chance. This term is denoted as

α

. The customary significance level is 5%. - Definition: confidence level is the opposite of significance level, where the confidence level indicates the degree of confidence that the result did not occur by chance. This term is calculated as

1 - α

. The customary confidence level is 95%. - For example, if you run an A/B testing experiment with a significance level ( $α$ ) of 5%, this means that if you determine a winner, you can be 95% confident ( $1 - α$ ) that the observed results are real and not an error caused by randomness. But there is a 5% chance that you could be wrong.

Next, when is statistical significance most practically used? - It is used in statistical hypothesis testing. - For example, you want to know whether or not having a healthier diet will result in lower levels of C-Reactive protein, and hence fewer incidents of infection and inflammation.

What is hypothesis testing? - Definition: hypothesis testing is the use of statistic to determine the probability that a given hypothesis is true. - There are two types of statistical hypotheses. - Definition: The null hypothesis, denoted by $H_{o}$ , is usually the hypothesis that sample observations result purely from chance. The most common null hypothesis is that the variable in question is equal to 0, i.e. this indicates that the variable has zero effect on the outcome of interest. - Definition: The alternative hypothesis, denoted by $H_{1}$ or $H_{a}$ , is the hypothesis that sample observations are influenced by some non-random cause. A common alternative hypothesis is that the variable in question has a non-zero effect on the outcome.

What is a t-test?

Let’s apply these definitions to our diet and C-Reactive Protein example. We want to know whether or not having a healthier diet will result in lower levels of C-Reactive protein, and hence fewer incidents of infection and inflammation. - What is the null hypothesis? - Answer: Having a healthier diet will have zero effect on the levels of C-Reactive protein.

What is the alternative hypothesis?
- Answer: Having a healthier diet will lower the levels of C-Reactive protein.

There are so many statistical tests that are appropriate for use if certain assumptions are met. But for this lesson, we will focus on the Student’s t-test. - Here a link to all statistical tests: https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/.

Let’s start with a Student’s t-test. This test will assess whether samples from two independent population provide evidence that the populations have different means.

Another aspect of hypothesis testing is the “use of a statistic”. What is this statistic (test statistic)? - Definition: A test statistic is a quantity derived from the sample and is used in statistical hypothesis testing to determine whether the null hypothesis should be accepted or rejected. - There are a wide variety of test statistics to use for a given problem. But as we are using a Student’s t-test, we will focus on the t-statistic.

T = \frac{\bar{x_{1}} - \bar{x_{2}}}{\sqrt{s_{1}^{2} / N_{1} + s_{2}^{2} / N_{2}}}

where

N_{1}

and

N_{2}

are the sample sizes,

\bar{x_{1}}

and

\bar{x_{2}}

are the sample means, and

s_{1}^{2}

and

s_{2}^{2}

are the sample variances

Now that we have calculated our t-statistic, what do we do with it to reject or fail to reject the null hypothesis? - We need to compare our t-statistic to a critical value. - Definition: a critical value is a point on the test distribution that is compared to the test statistic to determine whether to fail to reject or reject the null hypothesis.

What is the critical value in the t-distribution when conducting a Student’s t-test?

The absolute value of the critical value depends on the confidence level (A), significance level (P), and the degree of freedom (DF), which is the sample size - 1 or $N - 1$ .

What are the critical value(s) for when the significance level is 0.05 and the sample size approaches infinity? - Answer: 1.96 If the absolute value of our t-statistic is greater than our critical value, then we can reject the null hypothesis.

Finally, what is a p-value? - Definition: a p-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the p-value, the stronger the evidence against the null hypothesis.

The procedure of hypothesis testing involves four steps: 1. Formulate the null hypothesis and the alternative hypothesis. 2. Identify and compute a test statistic that can be used to reject or fail to reject the null hypothesis. Check your assumptions! 3. Compute the test statistic and p-value 4. Compare the p-value to an acceptable significance value, $α$ and compare the test statistic to acceptable critical value(s). If p-value $\leq α$ and the test-statistic $\geq$ |critical value|, then the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.

Now, let’s try it ourselves!

We will work with the diabetes dataset to learn how to apply the procedure of hypothesis testing to real world data. - This dataset contains 442 diabetes patients with data on age (AGE), sex (SEX), body mass index (BMI), mean arterial blood pressure (MAP), six blood serum measurements (TC, LDL, HDL, TCH, LTG, and GLU), and a quantitative measure of disease progression (Y).

# Import pandas, so that we can import the diabetes dataset and work with the data frame version of this data
import pandas as pd

# Set the path
path = 'https://raw.githubusercontent.com/GWC-DCMB/curriculum-notebooks/master/'
# This is where the file is located
filename = path + 'SampleData/diabetes.csv'

# Load the diabetes dataset into a DataFrame
diabetes_df = pd.read_csv(filename)
diabetes_df

	AGE	SEX	BMI	MAP	TC	LDL	HDL	TCH	LTG	GLU	Y
0	59	2	32.1	101.00	157	93.2	38.0	4.00	4.8598	87	151
1	48	1	21.6	87.00	183	103.2	70.0	3.00	3.8918	69	75
2	72	2	30.5	93.00	156	93.6	41.0	4.00	4.6728	85	141
3	24	1	25.3	84.00	198	131.4	40.0	5.00	4.8903	89	206
4	50	1	23.0	101.00	192	125.4	52.0	4.00	4.2905	80	135
...	...	...	...	...	...	...	...	...	...	...	...
437	60	2	28.2	112.00	185	113.8	42.0	4.00	4.9836	93	178
438	47	2	24.9	75.00	225	166.0	42.0	5.00	4.4427	102	104
439	60	2	24.9	99.67	162	106.6	43.0	3.77	4.1271	95	132
440	36	1	30.0	95.00	201	125.2	42.0	4.79	5.1299	85	220
441	36	1	19.6	71.00	250	133.2	97.0	3.00	4.5951	92	57

442 rows × 11 columns

We are interested in understanding whether there are differences in diabetes progression by sex, i.e. is the disease progression different for males vs. females?

1. Formulate the null hypothesis and the alternative hypothesis. - Null hypothesis: There is NO difference in disease progression between male and female. - Alternative hypothesis: There is a difference in disease progression by sex.

# Import numpy 
import numpy as np

# Look at all unique values for sex
np.unique(diabetes_df["SEX"])

array([1, 2])

Males are indicated by “1” for the variable “SEX”, while females are indicated by “2”.

# Define a vector of the disease progression for males and name it progression_male
diabetes_male = diabetes_df.query('SEX == 1')
progression_male = diabetes_male['Y']

# Define a vector of the disease progression for females and name it progression_female
diabetes_female = diabetes_df.query('SEX == 2')
progression_female = diabetes_female['Y']

2. Identify and compute a test statistic that can be used to reject or fail to reject the null hypothesis. - As we are working with two independent samples, we will use the two-sample t-test and use the t-statistic.

3. Compute the test statistic and p-value.

# Import stats methods to help calculate the t-statistic and p-value
from scipy import stats

# Run a Student's t-test with the method ttest_ind 
t_statistic, p_value = stats.ttest_ind(progression_male, progression_female)

# Print out the test statistic and p-value
print("t-statistic = " + str(t_statistic))
print("p-value = " + str(p_value))

t-statistic = -0.9041147550244715
p-value = 0.3664292946519826

4. Compare the p-value to an acceptable significance value, $α$ and compare the test statistic to acceptable critical value(s). If p-value $< α$ and the test-statistic $\geq$ +critical value or test-statistic $\leq$ -critical value, that the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.** - p-value $= 0.36 > 0.05$ , so we fail to reject the null hypothesis. - t-statistic $= - 0.90 > - 1.96$ , so this reaffirms that we fail to reject the null hypothesis. - Interpretation: There is no significant difference in diabetes progression between males and females.

Congratulations! You know how to conduct hypothesis testing with a Student’s t-test!

Misconceptions about statistical significance: 1. A low p-values implies a large effect. - Proper interpretation: A low p-value indicates that the outcome would be highly unlikely if the null hypothesis were true. A lower p-value does not usually equate to a large effect. There are cases when a low p-value can occur with a small effect. 2. A non-significant outcome (AKA high p-value) means that the null hypothesis is probably true. - Proper interpretation: A non-significant outcome (AKA high p-value) means that the data do not conclusively demonstrate that the null hypothesis is false. This is why we should say, “When the p-value > 0.05, we fail to reject the null hypothesis.” We should not say that we accept the null hypothesis when the p-value > 0.05.

Awesome work! You just learned about statistical significance! You learned:

Important definitions such as statistical significance, statistical significance level, null hypothesis, alternative hypothesis, test statistic, p-value, and critical values.
To conduct hypothesis testing.
To determine critical values to compare with your own test statistic in order to decide whether a variable has an effect on the outcome of interest.
To implement a Student’s t-test.