Hypothesis Testing

This chapter draws on material from:

Introductory Statistics by OpenStax, licensed under CC BY 4.0

Changes to the source material include light editing, adding new material, deleting original material, rearranging material, and adding first-person language from current author.

Introduction

One job of a data scientist is to make statistical inferences about populations based on samples taken from the population. As we've previously discussed, confidence intervals are one way to estimate a population parameter.

Another way to make a statistical inference is to make a decision about a parameter. For instance, a car dealer advertises that its new small truck gets 35 miles per gallon, on average. A tutoring service claims that its method of tutoring helps 90% of its students get an A or a B. A company says that women managers in their company earn an average of $60,000 per year.

A data scientist can make a decision about these claims. This process is called "hypothesis testing." A hypothesis test involves collecting data from a sample and evaluating the data. Then, the statistician makes a decision as to whether or not there is sufficient evidence, based upon analyses of the data, to reject the null hypothesis.

In this chapter, you will conduct hypothesis tests on single means and single proportions. You will also learn about the errors associated with these tests.

Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the data, and a conclusion. To perform a hypothesis test, a statistician will:

Set up two contradictory hypotheses.
Collect sample data (in homework problems, the data or summary statistics will be given to you).
Determine the correct distribution to perform the hypothesis test.
Analyze sample data by performing the calculations that ultimately will allow you to reject or decline to reject the null hypothesis.
Make a decision and write a meaningful conclusion.

Null and Alternative Hypotheses

The actual test begins by considering two hypotheses. They are called the null hypothesis and the alternative hypothesis. These hypotheses contain opposing viewpoints.

H₀: The null hypothesis. The null hypothesis is a statement of no difference, no effect, or no relationship. It can often be considered the status quo: what is already assumed or how things already stand. As a result, it takes a solid argument to reject the null hypothesis, because as things stand, there is no reason to believe that there is a difference between groups, effect of an intervention or treatment, or relationship between variables.

H_a: The alternative hypothesis: It is a claim about the population that is contradictory to H₀. It is usually what a researcher is trying to prove statistically, but it requires rejecting H₀first.

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis (and thereby accept the alternative hypothesis) or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are "reject H₀" if the sample information favors the alternative hypothesis or "do not reject H₀" or "decline to reject H₀" if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H₀ and H_a:

H₀	H_a
equal (=)	not equal (≠) or greater than (>) or less than (<)
greater than or equal to (≥)	less than (<)
less than or equal to (≤)	more than (>)

Outcomes of Hypothesis Testing

When you perform a hypothesis test, there are four possible outcomes depending on the actual truth (or falseness) of the null hypothesis H₀ and the decision to reject or not. The outcomes are summarized in the following table:

Your Decision	When H₀ is True	When H₀ is False
Do not reject H₀	Correct Outcome	Type II error
Reject H₀	Type I Error	Correct Outcome

The four possible outcomes in the table are:

The decision is not to reject H₀ when H₀ is true (correct decision).
The decision is to reject H₀ when H₀ is true (incorrect decision known as aType I error).
The decision is not to reject H₀ when, in fact, H₀ is false (incorrect decision known as a Type II error).
The decision is to reject H₀ when H₀ is false (correct decision whose probability is called the Power of the Test).

Each of the errors occurs with a particular probability. The Greek letters α and β represent the probabilities.

α = probability of a Type I error = P(Type I error) = probability of rejecting the null hypothesis when the null hypothesis is true.

β = probability of a Type II error = P(Type II error) = probability of not rejecting the null hypothesis when the null hypothesis is false.

α and β should be as small as possible because they are probabilities of errors. They are rarely zero.

The Power of the Test is 1 – β. Ideally, we want a high power that is as close to one as possible. Increasing the sample size can increase the Power of the Test.

Rare Events and Rethinking Assumptions

Suppose you make an assumption about a property of the population (this assumption is the null hypothesis). Then you gather sample data randomly. If the sample has properties that would be very unlikely to occur if the assumption is true, then you would conclude that your assumption about the population is probably incorrect. (Remember that your assumption is just an assumption—it is not a fact and it may or may not be true. But your sample data are real and the data are showing you a fact that seems to contradict your assumption.)

For example, Didi and Ali are at a birthday party of a very wealthy friend. They hurry to be first in line to grab a prize from a tall basket that they cannot see inside because they will be blindfolded. There are 200 plastic bubbles in the basket and Didi and Ali have been told that there is only one with a $100 bill. Didi is the first person to reach into the basket and pull out a bubble. Her bubble contains a $100 bill. The probability of this happening is 0.005 (or 0.5%).

Because this is so unlikely, this event provides an invitation for Didi and Ali to rethink their assumptions about how this game works. In particular, Ali is hoping that what the two of them were told is wrong, and there are actually several $100 bills in the basket. This would help explain why Didi got a $100 bill with her first draw, even though this is such an unlikely event. That said, it's not a guarantee that assumptions are wrong! Unlikely things still happen sometimes, so Ali might be out of luck here. However, because a "rare event" has occurred (Didi getting the $100 bill), Ali doubts the assumption about only one $100 bill being in the basket.

p-values and Hypotheses

We use similar considerations about rare events to decide whether or not to reject a null hypothesis (that is, a standing assumption about how the world works). To do this, we have to get back to probability.

In particular, we use the sample data to calculate the actual probability of getting our statistical results given our previous assumptions. This probability is called the p-value. The p-value is the probability that, if the null hypothesis (our assumption) is true, the results from another randomly selected sample will be as extreme or more extreme as the results obtained from the given sample. That is, the higher the p-value, the more likely we would be to get the same result for our statistical test from another sample; the lower the p-value, the less likely we would be to get the same result from our statistical test from another sample

A large p-value calculated from the data indicates that we should not reject the null hypothesis. Our assumptions about the world seem pretty solid—there's a good chance that we would get these results if we tried again. In contrast, the smaller the p-value, the more unlikely the outcome, and the stronger the evidence is against the null hypothesis. We would reject the null hypothesis if the evidence is strongly against it.

A systematic way to make a decision of whether to reject or not reject the null hypothesis is to compare the p-value and a preset or preconceived α—the probability of a Type I error. This preset value is also called a "significance level"). To reiterate, a preset α is the probability of a Type I error (rejecting the null hypothesis when the null hypothesis is actually true).

When you make a decision to reject or not reject H₀, do as follows:

If α > p-value, reject H₀. The results of the sample data are significant. There is sufficient evidence to conclude that H₀ is an incorrect belief and that the alternative hypothesis, H_a, may be correct.
If α ≤ p-value, do not reject H₀. The results of the sample data are not significant.There is not sufficient evidence to conclude that the alternative hypothesis, H_a, may be correct.
When you "do not reject H₀", it does not mean that you should believe that H₀ is true. It simply means that the sample data have failed to provide sufficient evidence to cast serious doubt about the truthfulness of H_o.

Traditionally, a significance level (α) is set at 0.05. That is, data scientists (and other researchers) want to make sure that there is less than a 5% chance that they would mistakenly reject standing assumptions about the world (in the form of a null hypothesis) in favor of a new understanding (in the form of an alternative hypothesis). However, remember that that danger of a Type I error is always there, just like it's always possible that Didi just had amazing luck in drawing the $100 bill first! In this class, we'll use the traditional significance levels in hypothesis testing.

Statistical (and Practical) Significance

p-values are a critical statistical tool for telling us whether to accept or reject the null hypothesis (that is, our standing assumptions about the world). However, another way of thinking about what p-values do is that they tell us whether a perceived difference, effect, or relationship matters—is that difference, effect, or relationship significant? Remember that inferential statistics is all about using samples to draw conclusions about populations. We would always prefer to have all the data from a population, but that's rarely practical, so we settle for a sample instead. However, we don't have all the data that we want, and there's a certain amount of randomness in a sample—we have to acknowledge the possibility that luck was "on our side" and that the differences, effects, or relationships we think we see aren't actually there.

For example, let's imagine that I'm teaching two LIS/ICT 661 classes simultaneously, and I want to use that opportunity to test the effectiveness of some of my teaching methods. In one class, I jazz up my introduction videos for each week—I write detailed scripts, add swelling music, and even fiddle around with some special effects. In the other class, I keep my videos the straightforward affairs that they are in this class. Everything else stays the same. At the end of the semester, I find that the average grade of the students in the fancy video class is slightly higher than the average grade of the students in the regular video class—say, an 89% average compared to an 87% average. There is a difference between those two averages... but it's not much of a difference, and it's hard to say whether it actually means something. Did my videos actually help? Or did students in the fancy videos class have more prior experience with statistics? Or were there an unusual amount of family emergencies in the regular videos class that distracted students from data science?

For reasons like this, data scientists use p-values to determine whether a difference (or an effect or a relationship) is actually significant—or whether it could be due to plain old chance. In the case of this example, I would use hypothesis testing to lend some clarity to my findings. If my p-value was below 0.05, I would describe my results as statistically significant—that is, I'd conclude that my fancy videos actually did have an effect on students' performance in this class. In contrast, if my p-value was at or above 0.05, I'd conclude that my results are not statistically significant—that is, there's no reason to believe that my fancy videos made any difference.

However, it's also important to note that data scientists (and other researchers) are starting to put less emphasis on the p-value as a measure of true significance. This is for two reasons:

First, as we hinted at early in the semester, larger data sets make it easier to get a smaller p-value—and it's easier to come by larger data sets these days. A larger dataset by definition makes it more likely to get "statistically significant" findings, which means that we're starting to see more and more statistical significance in findings. This challenges the actual value of statistical significance and pushes data scientists to evaluate their findings more holistically.

Second, and on a related note, statistical significance does not necessarily translate into practical significance. Let's imagine that a pharmaceutical company develops a treatment that is proven to have a statistically significant effect in increasing cancer patients' life expectancy. Let's also imagine for the purposes of this hypothetical example that we're quite sure this isn't a result of dumb luck or even a large sample size. That sounds great, right? Let's fund more research, ramp up production, and get this treatment into hospitals! However, let's imagine that the company's data scientists take another look at the data and come to the conclusion that this treatment only adds about five minutes to life expectancy. (This is an intentionally ridiculous example—I'm not sure life expectancy could be measured at this level!). In short, this effect is statistically significant but practically insignificant—the company is sure that the treatment works (in that it has an effect in the first place) but not sure that the treatment matters (in that the effect is so small). Even the most selfless pharmaceutical company would be forgiven for not pouring more money into further developing such a treatment.