Appendix B — Mathematical Results for Standardized Statistics
Here, we present some mathematical results associated with standardized statistics. The material provided in this appendix is not necessary for understanding the concepts presented in the text. It is provided only for those who are curious about the mathematics underlying some of the claims made throughout.
B.1 Single Mean Response, Same Signal
In Chapter 12, we argued that the level of background noise can make it difficult to detect a signal. That is illustrated in the following theorem.
Theorem B.1 Consider two samples of the same size \(n\). Suppose the sample mean is equivalent in both samples, and consider testing the hypotheses
\[H_0: \mu = \mu_0 \qquad \text{vs.} \qquad H_1: \mu \neq \mu_0\]
with each sample. The difference in the sum of squares
\[SS_0 - SS_1 = \sum_{i=1}^{n} \left(y_i - \mu_0\right)^2 - \sum_{i=1}^{n} \left(y_i - \bar{y}\right)^2\]
will be the same for each sample.
Proof. Observe that the difference in the sums of squares can be written as
\[ \begin{aligned} SS_0 - SS_1 &= \sum_{i=1}^{n} \left(y_i - \mu_0\right)^2 - \sum_{i=1}^{n} \left(y_i - \bar{y}\right)^2 \\ &= \sum_{i=1}^{n} \left[\left(y_i - \mu_0\right)^2 - \left(y_i - \bar{y}\right)^2\right] \\ &= \sum_{i=1}^{n} \left[y_i^2 - 2y_i \mu_0 + \mu_0^2 - y_i^2 + 2y_i \bar{y} - \bar{y}^2\right] \\ &= \sum_{i=1}^{n} \left[\mu_0^2 - 2y_i \left(\mu_0 - \bar{y}\right) - \bar{y}^2\right] \\ &= \sum_{i=1}^{n} \mu_0^2 - 2\left(\mu_0 - \bar{y}\right)\sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \bar{y}^2 \\ &= n\mu_0^2 - 2n\mu_0 \bar{y} + 2n\bar{y}^2 - n\bar{y}^2 \\ &= n\mu_0^2 - 2n\mu_0 \bar{y} + n\bar{y}^2 \\ &= n \left(\mu_0 - \bar{y}\right)^2. \end{aligned} \]
That is, the difference in the sum of squares is a function only of the null value and the sample mean. Therefore, if two samples of the same size have the same sample mean, then the difference in the sums of squares will be equivalent for the two samples.
B.2 Single Mean Response, Two Standardized Statistics
In Chapter 12, we presented the standardized statistic
\[T_1^* = \frac{SS_0 - SS_1}{s^2}.\]
However, another popular standardized statistic for testing hypotheses for a single mean is
\[T_2^* = \frac{\sqrt{n} \left(\bar{y} - \mu_0\right)}{s}.\]
However, these two standardized statistics are related.
Theorem B.2 Consider a sample of size \(n\) collected to test the following hypotheses:
\[H_0: \mu = \mu_0 \qquad \text{vs.} \qquad H_1: \mu \neq \mu_0.\]
Define
\[T_1^* = \frac{SS_0 - SS_1}{s^2},\]
where
\[ \begin{aligned} SS_0 &= \sum_{i=1}^{n} \left(y_i - \mu_0\right)^2 \\ SS_1 &= \sum_{i=1}^{n} \left(y_i - \bar{y}\right)^2. \end{aligned} \]
And, define
\[T_2^* = \frac{\sqrt{n} \left(\bar{y} - \mu_0\right)}{s}\]
where \(s\) is the sample standard deviation. Then, \(T_1^* = \left(T_2^*\right)^2\).
Proof. From the proof of Theorem B.1, we have that
\[SS_0 - SS_1 = n \left(\mu_0 - \bar{y}\right)^2.\]
We can then rewrite \(T_1^*\) as
\[T_1^* = \frac{n\left(\mu_0 - \bar{y}\right)^2}{s^2}.\]
This form of \(T_1^*\) then makes the result clear.
B.3 Least Squares Estimate for Intercept-Only Model
In Chapter 19, we state that the least squares estimate for the parameter \(\mu\) in a model of the data generating process that has the form
\[(\text{Response})_i = \mu + \varepsilon_i\]
is the sample mean response. While this is the intuitive estimate for \(\mu\) we presented in Chapter 10, we had not yet related that to the method of least squares introduced in Chapter 17.
Theorem B.3 Consider a sample of size \(n\); suppose we posit the following model for the data generating process:
\[y_i = \mu + \varepsilon_i\]
where \(y_i\) represents the \(i\)-th observed value of the response variable and \(\mu\) is the sole parameter. The least squares estimate of \(\mu\) is \(\bar{y}\), the sample mean.
Proof. Observe that, by definition (see Definition 17.2), the least squares estimate for this model is the value of \(\mu\) which minimizes the quantity
\[\sum_{i=1}^{n} \left(y_i - \mu\right)^2.\]
Observe that if we add and subtract the sample mean \(\bar{y}\) inside the squared quantity, we can rewrite the above quantity as
\[ \begin{aligned} \sum_{i=1}^{n} \left(y_i - \mu\right)^2 &= \sum_{i=1}^{n} \left(y_i - \bar{y} + \bar{y} - \mu\right)^2 \\ &= \sum_{i=1}^{n} \left[\left(y_i - \bar{y}\right)^2 + 2\left(y_i - \bar{y}\right)\left(\bar{y} - \mu\right) + \left(\bar{y} - \mu\right)^2\right] \\ &= \sum_{i=1}^{n} \left(y_i - \bar{y}\right)^2 + 2\left(\bar{y} - \mu\right) \sum_{i=1}^{n}\left(y_i - \bar{y}\right) + \sum_{i=1}^{n}\left(\bar{y} - \mu\right)^2 \\ &= \sum_{i=1}^{n} \left(y_i - \bar{y}\right)^2 + 2\left(\bar{y} - \mu\right) \left(n\bar{y} - n\bar{y}\right) + n\left(\bar{y} - \mu\right)^2, \end{aligned} \]
where we make use of the fact that \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i\) implies that \(\sum_{i=1}^{n} y_i = n\bar{y}\). Now, we notice that the second term in the above expansion resolves to 0. Therefore, we have that
\[\sum_{i=1}^{n} \left(y_i - \mu\right)^2 = \sum_{i=1}^{n} \left(y_i - \bar{y}\right)^2 + n\left(\bar{y} - \mu\right)^2.\]
Notice that the second term is a squared difference; therefore, the second term must always be non-negative. Therefore, the second term can only make the sum larger; further, the second term will reach 0, the smallest value possible, when \(\mu = \bar{y}\). Therefore, the sum is minimized when we take \(\mu = \bar{y}\). This gives that \(\bar{y}\) minimizes the quantity of interest.
This theorem allows us to conclude that the error sum of squares for a model of the form
\[(\text{Response})_i = \mu + \varepsilon_i\]
is given by
\[\sum_{i=1}^{n} \left[(\text{Response})_i - (\text{Overall Mean Response})\right]^2,\]
where the overall mean response is the sample mean response observed.
B.4 Class of Hypotheses that are Testable
Chapter 19 developed a general standardized statistic for testing hypotheses in a linear model described by Equation 17.3. Chapter 21 extended this standardized statistic to all linear models. This standardized statistic allows us to have a unifying testing framework across the text. However, not all hypotheses can be tested using this standardized statistic. This framework applies to linear hypotheses. For example, consider the model (Equation 17.3):
\[(\text{Response})_i = \beta_0 + \beta_1 (\text{Predictor})_i + \varepsilon_i.\]
For this model, we can use the standardized statistic in Definition 19.7 to test any null hypothesis of the form
\[H_0: c_0\beta_0 = c_2, \ c_1\beta_1 = c_3\]
or
\[H_0: c_0\beta_0 + c_1\beta_1 = c_2,\]
where \(c_0, c_1, c_2, c_3\) are known constants. That is, we can test a hypothesis of the form
\[H_0: \beta_1 = 0 \qquad \text{vs.} \qquad H_1: \beta_1 \neq 0\]
because this can be written by setting \(c_1 = 1\) and \(c_0 = c_2 = c_3 = 0\) in the first form of the hypothesis given above. We can also test a hypothesis of the form
\[H_0: \beta_1 = 0, \beta_0 = 3 \qquad \text{vs.} \qquad H_1: \text{At least one } \beta_j \text{ differs}\]
which corresponds to a reduce model of
\[(\text{Response})_i = 3 + \varepsilon_i\]
because this can be written by setting \(c_0 = c_1 = 1\), \(c_2 = 3\), and \(c_3 = 0\) in the first form of the hypothesis given above.
However, we cannot test a hypothesis of the form \(\beta_0\beta_1 = 1\); while this may be a valid hypothesis, it cannot be written in either of the above forms described above. We would need additional theory to test such a hypothesis.
B.5 Equivalence of Hypotheses with Blocking
Our model for the data generating process comparing a quantitative response across the \(k\) levels of a factor in the presence of \(b\) blocks is
\[(\text{Response})_i = \sum_{j=1}^{k} \mu_j (\text{Group } j)_i + \sum_{m=2}^{b} \beta_m (\text{Block } m)_i + \varepsilon_i\]
where \((\text{Group } j)\) and \((\text{Block } m)\) are appropriately defined indicator variables. Without loss of generality, consider the case where \(k = 2\) and \(b = 2\). Then, our model reduces to
\[(\text{Response})_i = \mu_1 (\text{Group 1})_i + \mu_2 (\text{Group 2})_i + \beta_2 (\text{Block 2})_i + \varepsilon_i.\]
We argue heuristically in Chapter 36 that testing the hypotheses
\[H_0: \mu_1 = \mu_2 \qquad \text{vs.} \qquad H_1: \mu_1 \neq \mu_2\]
which are hypotheses about the average response in Block 1 is equivalent to testing hypotheses about the overall average response across all blocks. To establish this more formally, note that based on the above model for the data generating process, the overall average response for Group 1, call it \(\theta_1\), is given by
\[\theta_1 = \frac{1}{2} \mu_1 + \frac{1}{2} \left(\mu_1 + \beta_2\right) = \mu_1 + \frac{\beta_2}{2},\]
where we average the average response from Block 1 with that of Block 2. Similarly, the average response for Group 2, call it \(\theta_2\), is given by
\[\theta_2 = \frac{1}{2} \mu_2 + \frac{1}{2} \left(\mu_2 + \beta_2\right) = \mu_2 + \frac{\beta_2}{2}.\]
Now, note that \(\theta_1 - \theta_2 = \mu_1 - \mu_2\); therefore, \(\theta_1 = \theta_2\) if and only if \(\mu_1 = \mu_2\). That is, testing whether the overall average response across all individuals is equal across groups is equivalent to testing whether the average response within Block 1 is equal across groups.