What Does Statistically Significant Mean?

Michael Tang
7 min readDec 3, 2020

--

https://www.kdnuggets.com/wp-content/uploads/xkcd-p-value-jellybeans.jpg

Statistical Significance is one of the most important concepts in statistics. It’s being used widely in all sorts of scientific publications and is the fundamental building block for many common statistical tests such as ANOVA testing. People often associate Statistically Significance with ‘p-value’ and always pair it with numbers such as 0.1, 0.05, etc. It’s pretty straightforward to apply Statistical Significance on paper since it’s just comparing numbers. However, I still found its concept hard to comprehend. For example, what does it really mean when something is said to be statistically significant? This question bugged me for a longest time, and I hope this post can help those aspiring data science practitioners.

Hypothesis Testing

Before I start, it’s important to understand Hypothesis Testing, which I very briefly mentioned in my post about Confidence Interval. Why so? Because the ultimate goal of Statistical Significance is to prove the hypothesis we have raised in the first place. More formally, it’s the procedure for translating information from sample data into a measure of evidence for a claim about the population. For instance, a valid claim could be “The mean positive rate of COVID-19 in the United States is 2%”.

To make this claim testable, it could be further turned into two components:

  • Null Hypothesis (H₀): an assumption equality with respect to the population parameter (=, ≤, ≥).
  • Alternative Hypothesis (Hₐ or H₁): complement to the Null Hypothesis (≠, >, <, respectively).

Therefore, the COVID-19 example above could be written as:

  • H₀: µ = 2%
  • H₁: µ ≠ 2%

In addition, these types of hypothesis could be referred to as a two-sided hypothesis test because the population means that are smaller or larger than 2% all fall under the alternative hypothesis. On the other hand, if the null hypothesis says the mean positive rate µ ≤ 2%, it would then be a one-sided hypothesis test since the alternative hypothesis only lies in the larger side.

The next step checks if sample data supports the claim in the null hypothesis. To do this, we can leverage the power of the Central Limit Theorem to check whether the sample data is consistent with the null hypothesis. More concretely, we can predict the distribution of sample means assuming that the claim in the null hypothesis is true, which is also known as the null distribution. This is very similar to what we are doing with the Confidence Interval. We then determine the probability of observing the actual observed sample mean under the null distribution. If the observed sample mean is ‘likely’ to occur, then the sample data aligns with the claim in null hypothesis. Otherwise, the sample data does not provide enough evidence to support the null hypothesis.

There are 2 takeaways:

  • The null hypothesis is always assumed to be TRUE during a hypothesis test.
  • Insufficient evidence to support the null hypothesis generally means we are rejecting the null hypothesis.

Going back to our example again. If we assume that the sample standard deviation is 1% and the sample size is 50, the standard error (or the standard deviation of the sample mean) is then σ / √n ≈ s / √n. Sample standard deviation (s) is used here to estimate the population standard deviation (σ) since the later is unknown). Then we can say that under the null hypothesis, the sample mean is expected to be normally distributed with a mean of 2 and a standard deviation of roughly 0.14, which is illustrated in the figure below:

Normal distribution with a mean of 2 and a sd of 0.14

Using this distribution, an observed sample mean of 1.9% is likely to occur since it’s pretty close to the center. However, an observed sample mean of 2.4% is highly improbable since it falls in the lower tail of this distribution, which does not agree with the null hypothesis. To better evaluate our hypothesis, we can use a quantity called p-value.

Using this distribution, an observed sample mean of 1.9% is likely to occur since it’s pretty close to the center. However, an observed sample mean of 2.4% is highly improbable since it falls in the lower tail of this distribution, which does not agree with the null hypothesis. To better evaluate our hypothesis, we can use a quantity called p-value.

p-value

Formally, p-value is defined as the probability of observing data as extreme or more extreme than the one observed, assuming the null hypothesis is true.

The definition of p-value is quite bewildering, at least it took me quite a while to process it. To better understand it, we can divide it into two parts and look at them separately:

  • If we look at the second half first, this is the point I have stressed about earlier. We want to assume the null hypothesis is true and find evidence backend by the sample data to reject it. In drug testing, we start by making the null hypothesis which states that taking the drug produces no effect or no negative effect on certain health conditions, The next step would be to conduct clinical trials and determine if sample data is inconsistent with the claim in our null hypothesis. For example, if 90% of the patient’s show symptomatic improvement assuming the drug is not helpful, we might want to reject the null hypothesis. (This process is obviously much more complex in real life)
  • “as extreme or more extreme than the one observed” means sample statistics that are further than the observed value in the direction of the alternative hypothesis. In other words, these are one or both of the tails of the normal distribution depending on the alternative hypothesis. More specifically, If we are dealing with a one-sided test and the null hypothesis is smaller than or equal to a certain value, we then need to look at the section to the right of the observed value. If the test is two-sided, we need to consider sample statistics in both tails.

Using the COVID-19 example, we are testing H₁: µ ≠ 2%. Observed sample means that are much smaller than or much larger than 2% would suggest that the mean positive rate is not 2%. Therefore, sample means that are as extreme or more extreme than an observed mean of 2.4% are those that are 2.4% or higher OR 1.6% or lower (2.4% is 0.4% higher than 2%, so the lower cutoff for extreme is 2%−0.4% = 1.6%).

Once that is sorted out, we could calculate the p-value using tools like R. The following code snippet is used to calculate the p-value for the example above:

R code for calculating p-values

Based on the calculation, the p-value is 0.0047. This means that assuming null hypothesis is true, the probability of observing another sample mean as extreme or more extreme than the observed 2.4% is 0.0047, which is highly unlikely. In others, we just observed something that is very improbable in the world where this null hypothesis is true.

Statistical Significance

How do we know if an estimated p-value is small enough? The typical thresholds are 0.01, 0.05 and 0.1. These are referred to as the significance level (ɑ).

If the p-value of a hypothesis test is less than or equal to ɑ, it means that we observed something that is unlikely to occur under the null hypothesis. Otherwise, if the p-value is greater than ɑ, we would conclude that the data is not inconsistent the the claim in the null hypothesis.

In summary, If p-value ≤ α = significance level, we reject the null hypothesis and conclude there is a statistically significant result. Otherwise, we fail to reject the null hypothesis.

Now we are back to the original question we had, what does it mean when something is said to be statistically significant? WIth our COVID-19 example, since the p-value is less than 0.01, the evidence that the mean positive rate in the United States is greater than 2.4% is statistically significant at level 0.01. In plain English, what that means is that under the null hypothesis where the average positive rate should be 2%, sample data means are expected to be close 2% as well. However, because we observed a sample average of 2.4% which should be extremely rare, it’s reasonable to conclude that the null hypothesis is inaccurate.

One thing to note is that even in the case where the null hypothesis is rejected, we CANNOT claim that the alternative hypothesis is accepted nor that the hypothesis test proves the claim in the alternative hypothesis is correct. Because sample data implies sampling variability, and there is no guarantee that the test will correctly determine the population parameter.

So what happens when the null hypothesis is rejected? The next step is often to determine an estimated interval for the average positive rate. We can calculate a 95% confidence interval using the following formula:

95% Confidence Interva

To learn more about confidence interval, check out my other post: A Detailed Look at Confidence Interval

Summary:

  • Hypothesis Testing involved making two hypotheses that are complements of each other.
  • Null hypothesis is assumed to be True during the test.
  • p-value is the probability that observing data as rare/extreme or more rate/extreme than the one observed.
  • When p-value is less than or equal to a pre-defined significance level, we can reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.
  • The evidence that supports the alternative hypothesis is statistically significant at the significance level (e.g. 0.05) if null-hypothesis is rejected at that level.

--

--

Michael Tang

M.S. in Data Science candidate, 2022 @ Duke University | Biomedical Engineer | Workout Enthusiast