28 Conditions on the Error Term of an ANOVA Model

In the previous chapter we developed a general model for the data generating process of a quantitative response as a function of a categorical predictor:

\[(\text{Response})_i = \sum_{j=1}^{k} \mu_k (\text{Group } j)_{i} + \varepsilon_i,\]

where

\[(\text{Group } j)_i = \begin{cases} 1 & \text{if the i-th observation is from Group } j \\ 0 & \text{otherwise} \end{cases}\]

are indicator variables to capture the level of the factor to which each observation belongs.

We also discussed a common method for estimating the parameters of this model from a sample — the method of least squares. However, if we are to construct a model for the sampling distribution of these estimates we must add some structure to the stochastic component \(\varepsilon\) in the model. In this chapter, we focus on the most common conditions we might impose and how those conditions impact the model for the sampling and null distributions (and therefore the computation of a confidence interval or p-value).

28.1 Independent Errors

The first condition we consider is that the noise attributed to one observed individual is independent of the noise attributed to any other individual observed. That is, the amount of error in any one individual’s response is unrelated to the error in any other response observed. This is the same condition we introduced in Chapter 10 and Chapter 18.

Independence Condition

The independence condition states that the error in one observation is independent (see Definition 10.3) of the error in all other observations.

With just this condition, we can use a bootstrap algorithm in order to model the sampling distribution of the least squares estimates of our parameters (see Appendix A). However, additional conditions are often considered.

28.2 Same Degree of Precision

The second condition that is typically placed on the distribution of the errors is that the variability of the errors is the same in each group. We introduced this condition in Chapter 18. Practically, this condition implies that the variability of the response within each group is the same for all groups.

Constant Variance

Also called homoskedasticity, the constant variance condition states that the variability of the errors within each group is the same across all groups.

With this additional condition imposed, we are able to modify our bootstrap algorithm when constructing a model for the sampling distribution of the least squares estimates.

At this point, you might ask “if we assume constant variance in the classical ANOVA model and in the classical regression model, why did we not consider this assumption in the single response case (Chapter 10)?” Remember, the condition is not that the errors are the same; the condition is that the variability of the errors within each group is the same for every group. In Chapter 10, our model for the data generating response did not have a predictor; therefore, there was only a single group to which the response belonged. The variability in the response cannot vary across the predictor if there is no predictor; so, this condition could not be violated in that setting!

28.3 Specific Form of the Error Distribution

The third condition that is typically placed on the distribution of the errors is that the errors follow a Normal distribution, as discussed in Chapter 18. Here, we are assuming a particular structure on the distribution of the error population.

Normality

The normality condition states that the distribution of the errors follows the functional form of a Normal distribution (Definition 18.2).

Let’s think about what this condition means for the responses. Given the shape of the Normal distribution, imposing this condition (in addition to the other conditions) implies that some errors are positive and some are negative. This in turn implies that some responses will be above average for their group, and some responses will be below average for their group. More, because the distribution of the response within each group is just a shifted version of the distribution of the errors, we know that the distribution of the response variable itself follows a Normal distribution.

With this last condition imposed, we can construct an analytic model for the sampling distribution of the least squares estimates. As in regression modeling, we are not required to impose all three conditions in order to obtain a model for the sampling distribution of the estimates. Historically, however, all three conditions have been routinely imposed in the scientific and engineering literature.

28.4 Classical ANOVA Model

We have discussed three conditions we could place on the stochastic portion of the data generating process. Placing all three conditions on the error term is what we refer to as the “Classical ANOVA Model.”

Definition 28.1 (Classical ANOVA Model) For a quantitative response and single categorical predictor with \(k\) levels, the classical ANOVA model assumes the following data generating process:

\[(\text{Response})_i = \sum_{j=1}^{k} \mu_j (\text{Group } j)_i + \varepsilon_i\]

where

\[ (\text{Group } j)_{i} = \begin{cases} 1 & \text{if i-th observation belongs to group } j \\ 0 & \text{otherwise} \end{cases} \]

are indicator variables and where

The error in the response for one subject is independent of the error in the response for all other subjects.
The variability in the error of the response within each group is the same across all groups.
The errors follow a Normal Distribution.

This is the default “ANOVA” analysis implemented in the majority of statistical packages.

Warning

A “hidden” (typically unstated but should not be ignored) condition is that the sample is representative of the underlying population. In the one sample case (Chapter 10), we referred to this as the errors being “identically distributed.” We no longer use the “identically distributed” language for technical reasons; however, we still require that the sample be representative of the underlying population.

We note that “ANOVA” need not require all three conditions imposed in Definition 28.1. Placing all three conditions on the error term results in a specific analytical model for the sampling distribution of the least squares estimates. Changing the conditions changes the way we model the sampling distribution.

Big Idea

The model for the sampling distribution of a statistic is determined by the conditions you place on the stochastic portion of the model for the data generating process.

At this point, you might be wondering what happened to the “mean-0” condition we imposed in regression models. The mean-0 condition stated that the error was 0, on average, for all values of the predictor. Recall that this assumption implied that the model for the mean response was correctly specified — that no curvature was ignored. In our model above, with only a single categorical predictor with \(k\) levels (captured through \(k\) indicator variables), there is no “trend” being described. That is, instead of saying that the mean response increases, decreases, or has any particular form, the deterministic portion of the model allows the mean response in one group to be completely unrelated to the mean response in any other group. Since there is no “trend” term in the mean response model, we cannot have misspecified that trend. The mean-0 condition is only required when you make a simplifying assumption constraining the mean response to follow a specific functional form.

Note

The lack of the mean-0 condition in ANOVA is why we did not need to consider the mean-0 condition in Unit I when we were focused on inference about a single mean.

28.5 Imposing the Conditions

Let’s return to our model for the moral expectation score as a function of the food exposure group given in Equation 27.1:

\[(\text{Moral Expectations})_i = \mu_1 (\text{Comfort})_i + \mu_2 (\text{Control})_i + \mu_3 (\text{Organic})_i + \varepsilon_i,\]

where we use the same indicator variables defined in Chapter 27. We were interested in the following research question:

Does the average moral expectation score differ for at least one of the three food exposure groups?

This was captured by the following hypotheses:

\(H_0: \mu_1 = \mu_2 = \mu_3\)
\(H_1: \text{at least one } \mu_j \text{ differs}.\)

Using the method of least squares, we constructed point estimates of the parameters in the model; this leads to the following estimates of the average moral expectation score for each exposure group:

Table 28.1: Estimated average moral expectation score for participants exposed to one of three food groups.

Exposure Group	Estimated Mean Moral Expectation Score
Comfort Foods	5.58
Control Foods	5.65
Organic Foods	5.75

If we are willing to assume the data is consistent with the conditions for the classical ANOVA model, we are able to model the sampling distribution of these estimates and therefore construct confidence intervals. Table 28.2 summarizes the results of fitting the model described in Equation 27.1 using the data available from the Organic Foods Case Study. In addition to the least squares estimates, it also contains the standard error (see Definition 6.4) of each statistic, quantifying the variability in the estimates. Finally, there is a 95% confidence interval for each parameter.

Table 28.2: Summary of the model fit relating the moral expectation score of college students to the type of food to which they were exposed.

Term	Estimate	Standard Error	Lower 95% CI	Upper 95% CI
Comfort Foods Group	5.585	0.128	5.331	5.839
Control Foods Group	5.654	0.130	5.397	5.912
Organic Foods Group	5.750	0.131	5.490	6.010

Chapter 6 described, in general, how confidence intervals are constructed. Under the classical ANOVA model, there is an analytical model for the sampling distribution, and it is known. As a result, the confidence interval can be computed from a formula.

Formula for Confidence Interval Under Classical ANOVA Model

If the classical ANOVA model is assumed, the 95% confidence interval for the parameter \(\mu_j\) can be approximated by

\[\widehat{\mu}_j \pm (1.96) \left(\text{standard error of } \widehat{\mu}_j\right)\]

The confidence intervals for the mean moral expectation score within each group were constructed assuming the classical ANOVA model. And, while the confidence intervals are similar for each of the groups, we have not actually addressed the question of interest. We cannot use the confidence intervals given to directly compare the groups. Instead, we must directly attack the hypotheses of interest by computing a p-value. We consider this in the next chapter.

Warning

It is common to try and compare the mean response of several groups by determining if the confidence intervals for the mean response of each group overlap. This is a mistake. If you want to compare groups, you need to do conduct an analysis that directly addresses the hypothesis of interest.

28.6 Recap

We have covered a lot of ground in this chapter, and it is worth taking a moment to summarize the big ideas. In order to compare the mean response in each group, we took a step back and modeled the data generating process. Such a model consists of two components: a deterministic component explaining the response as a function of the predictor, and a stochastic component capturing the noise in the system.

Certain conditions are placed on the distribution of the noise in our model. With a full set of conditions (classical ANOVA model), we are able to model the sampling distribution of the least squares estimates analytically. We can also construct an empirical model for the sampling distribution of the least squares estimates assuming the data is consistent with fewer conditions.

In general, the more conditions we are willing to impose on the data generating process, the more tractable the analysis; however, the most important aspect is that the data come from a process which is consistent with the conditions we impose, which is discussed in Chapter 30.