3 Asking the Right Questions
The discipline of statistics is about turning data into information in order to address some question. While there may be no such thing as a stupid question, there are ill-posed questions — those which cannot be answered as stated. Consider the Deepwater Horizon Case Study (Chapter 2). It might seem natural to ask “if a volunteer cleans wildlife, will she develop adverse respiratory symptoms?” Let’s consider the data. Of the 54 volunteers assigned to wildlife cleaning and rehabilitation, 15 reported experiencing adverse respiratory symptoms (“nose irritation, sinus problems, or sore throat”); while some volunteers developed symptoms, others did not. It seems the answer to our question is then “it depends” or “maybe.” This is an example of an ill-posed question. Such questions exist because of variability, the fact that every subject in the population does not behave in exactly the same way; that is, the value of a variable potentially differs from one observation to the next. In our example not every volunteer had the same reaction when directly exposed to oil.
It is variability that creates a need for the discipline of statistics; in fact, you could think of statistics as the study and characterization of variability. We must therefore learn to ask the right questions — those which can be answered in the presence of variability.
Definition 3.1 (Variability) The notion that measurements differ from one observation to another.
The presence of variability makes some questions ill-posed; statistics concerns itself with how to address questions in the presence of variability.
3.1 Characterizing a Variable
Recall that the goal of statistical inference is to say something about the population; as a result, any question we ask should then be about on this larger group. The first step to constructing a well-posed question is then to identify the population of interest for the study. For the Deepwater Horizon Case Study, it is unlikely that we are only interested in these 54 observed volunteers assigned to wildlife cleaning. In reality, we probably want to say something about volunteers for any oil spill. The 54 volunteers in our dataset form the sample, a subset from all volunteers who clean wildlife following an oil spill. Our population of interest is comprised of all volunteers who clean wildlife following an oil spill.
When identifying the population of interest for a research question you have, be specific! Suppose you are trying to estimate the average height of trees. Are you really interested in all trees? Or, are you interested in Maple trees within the city limits of Terre Haute, Indiana?
Since we expect that the reaction to oil exposure — the primary variable of interest for this study, sometimes called the response — to vary from one individual to another, we cannot ask a question about the value of the reaction (whether they experienced symptoms or not). Instead, we want to characterize the distribution of the response.
Definition 3.2 (Response) The primary variable of interest within a study. This is the variable you would either like to explain or estimate.
Definition 3.3 (Distribution) The pattern of variability corresponding to a set of values.
Notice that in this case, the response is a categorical variable; describing the distribution of such a variable is equivalent to describing how individuals are divided among the possible groups. With a finite number of observations, we could present the number of observations, the frequency, within each group. For example, of the 54 volunteers, 15 experienced adverse symptoms and 39 did not. This works well within the sample; however, as our population is infinitely large (all volunteers cleaning wildlife following an oil spill), reporting the frequencies is not appropriate. In this case, we report the fraction of observations, the relative frequency, falling within each group; this helps convey information about the distribution of this variable. That is, the relative frequencies give us a sense of which values of the variable are more or less common in the sample.
Definition 3.4 (Frequency) The number of observations in a sample falling into a particular group (level) defined by a categorical variable.
Definition 3.5 (Relative Frequency) Also called the “proportion,” the fraction of observations falling into a particular group (level) of a categorical variable.
Numeric quantities, like the proportion, which summarize the distribution of a variable within the population are known as parameters.
Definition 3.6 (Parameter) Numeric quantity which summarizes the distribution of a variable within the population of interest. Generally denoted by Greek letters in statistical formulas.
While the value of a variable may vary across the population, the parameter is a single fixed constant which summarizes the variable for that population. For example, the grade received on an exam varies from one student to another in a class; but, the average exam grade is a fixed number which summarizes the class as a whole. Well-posed questions can be constructed if we limit ourselves to questions about the parameter. The second step in constructing well-posed questions is then to identify the parameter of interest.
The questions we ask generally fall into one of two categories:
- Estimation: what proportion of volunteers who clean wildlife following an oil spill will experience adverse respiratory symptoms?
- Hypothesis Testing: is it reasonable no more than 1 in 5 volunteers who clean wildlife following an oil spill will experience adverse respiratory symptoms; or, is there evidence more than 1 in 5 volunteers who clean wildlife following an oil spill will experience adverse respiratory symptoms?
Definition 3.7 (Estimation) Using the sample to approximate the value of a parameter from the underlying population.
Definition 3.8 (Hypothesis Testing) Using a sample to determine if the data is consistent with a working theory or if there is evidence to suggest the data is not consistent with the theory.
Since we do not get to observe the population (we only see the sample), we cannot observe the value of the parameter. That is, we will never know the true proportion of volunteers who experience symptoms. However, we can determine what the data suggests about the population (that is what inference is all about).
Parameters are unknown values and can never, in general, be known.
It turns out, the vast majority of research questions can be framed in terms of a parameter. This is the first of what we consider the Five Fundamental Ideas of Inference.
A research question can often be framed in terms of a parameter that characterizes the population. Framing the question should then guide our analysis.
We now have a way of describing a well-posed question, a question which can be addressed using data. Well posed questions are about the population and can be framed in terms of a parameter which summarizes that population. We now describe how these questions are typically framed.
3.2 Framing the Question
In engineering and scientific applications, many questions fall under the second category of hypothesis testing, which is a form of model comparison in which data is collected to help the researcher choose between two competing theories for the parameter of interest. In this section, we consider the terminology surrounding specifying such questions.
For the Deepwater Horizon Case Study suppose we are interested in addressing the following question:
Is there evidence that more than 1 in 5 volunteers who clean wildlife following an oil spill will develop adverse respiratory symptoms?
The question itself is about the population (all volunteers assigned to clean wildlife following an oil spill) and is centered on a parameter (the proportion who develop adverse respiratory symptoms). That is, this is a well-posed question that can be answered with appropriate data. The overall process for addressing these types of questions is similar to conducting a trial in a court of law. In the United States, a trial has the following essential steps:
- Assume the defendant is innocent.
- Present evidence to establish guilt, to the contrary of innocence (prosecution’s responsibility).
- Consider the weight of the evidence presented (jury’s responsibility).
- Make a decision. If the evidence is “beyond a reasonable doubt,” the jury declares the defendant guilty; otherwise, the jury declares the defendant not guilty.
The process of conducting a hypothesis test has similar essential steps:
- Assume the opposite of what we want the data to show (develop a working theory).
- Gather data and compare it to the proposed model from step (1).
- Quantify the likelihood of our data from step (2) under the proposed model.
- If the likelihood is small, conclude the data is not consistent with the working model (there is evidence for what we want to show); otherwise, conclude the data is consistent with the working model (there is no evidence for what we want to show).
Notice that a trial focuses not on proving guilt but on disproving innocence; similarly, in statistics, we are able to establish evidence against a specified theory. This is one of several subtle points in hypothesis testing. We will discuss these subtleties at various points throughout the text and revisit the overall concepts often. Here, we focus solely on that first step — developing a working theory that we want to disprove.
This process may seem counter-intuitive; it is natural to ask “why can’t we prove guilt directly?” However, when you disprove one statement, you are proving that statement’s opposite — a technique known in mathematics as “proof by contradiction.” So, our approach to proving a statement is to disprove all other possibilities. It is similar to the technique of the fictional detective Sherlock Holmes (Doyle 1890, pg. 92): “Eliminate all other factors, and the one which remains must be the truth.”
Consider the above question for the Deepwater Horizon Case Study. We want to find evidence that the proportion experiencing adverse symptoms exceeds 0.20 (1 in 5). Therefore, we would like to disprove (or provide evidence against) the statement that the proportion experiencing adverse symptoms is no more than 0.20. This statement that we would like to disprove is known as the null hypothesis; the opposite of this statement, called the alternative hypothesis, captures what we as the researchers would like to establish.
Definition 3.9 (Null Hypothesis) The statement (or theory) about the parameter that we would like to disprove. This is denoted \(H_0\), read “H-naught” or “H-zero”.
Definition 3.10 (Alternative Hypothesis) The statement (or theory) about the parameter capturing what we would like to provide evidence for; this is the opposite of the null hypothesis. This is denoted \(H_1\) or \(H_a\), read “H-one” and “H-A” respectively.
For the Deepwater Horizon Case Study, we write:
\(H_0:\) The proportion of volunteers assigned to clean wildlife following an oil spill who experience adverse respiratory symptoms is no more than 0.20.
\(H_1:\) The proportion of volunteers assigned to clean wildlife following an oil spill who experience adverse respiratory symptoms exceeds 0.20.
Each hypothesis is a well-posed statement (about a parameter characterizing the entire population), and the two statements are exactly opposite of one another meaning only one can be a true statement.
When framing your questions, be sure your null hypothesis and alternative hypothesis are exact opposites of one another, and ensure the “equality” component always goes in the null hypothesis.
We can now collect data and determine if it is consistent with the null hypothesis (a statement similar to “not guilty”) or if the data provides evidence against the null hypothesis and in favor of the alternative (a statement similar to “guilty”).
The term “consistent” and “reasonable” will be used interchangeably throughout the text; however, these terms differ substantially from the term “evidence.” The data is said to be consistent with a statement if the data is aligned with that statement. We have evidence for a statement when the data is aligned with that statement and it is not aligned with the opposite of the statement (when we can disprove something). We will see this develop more in Chapters 6 and 7.
Often these statements are written in a bit more of a mathematical structure in which a Greek letter is used to represent the parameter of interest. For example, we might write
Let \(\theta\) represent the proportion of volunteers (assigned to clean wildlife following an oil spill) who experience adverse respiratory symptoms.
\(H_0: \theta \leq 0.20\)
\(H_1: \theta > 0.20\)
In the above statements, \(\theta\) represents the parameter of interest; the value 0.20 is known as the null value.
Definition 3.11 (Null Value) The value associated with the equality component of the null hypothesis; it forms the threshold or boundary between the hypotheses. Note: not all questions of interest require a null value be specified.
Hypothesis testing is a form of statistical inference in which we quantify the evidence against a working theory (captured by the null hypothesis). We essentially argue that the data supports the alternative if it is not consistent with the working theory.
This section has focused on developing the null and alternative hypothesis when our question of interest is best characterized as one of comparing models or evaluating a particular statement. If our goal is estimation, a null and alternative hypothesis are not applicable. For example, we might have the following goal:
Estimate the proportion of volunteers (assigned to clean wildlife following an oil spill) who experience adverse respiratory symptoms.
In this version of our research “question” there is no statement which needs to be evaluated. We are interested in estimation, not hypothesis testing and thus there is no corresponding null and alternative hypothesis.
In order to frame a research question, consider the following steps:
- Identify the population of interest.
- Identify the parameter(s) of interest.
- Determine if you are interested in estimating the parameter(s) or quantifying the evidence against some working theory.
- If you are interested in testing a working theory, make the null hypothesis the working theory and the alternative hypothesis the exact opposite statement (capturing what you want to provide evidence for).