22  Putting it All Together

For the Seismic Activity Case Study, consider the following question:

Overall, does the average bracketed duration at a location depend on the distance a location is from the center of an earthquake?

While we have touched on this question in previous chapters, we investigate this question more fully in this chapter.

22.1 Graphical Summary

Before developing a statistical model to address our question, we summarize the data graphically. The question suggests the bracketed duration is the response variable and the epicentral distance is the predictor. Figure 22.1 illustrates the relationship between the bracketed duration and the epicentral distance. We note that the axis for the epicentral distance takes logarithmic steps to better illustrate the relationship. That is, moving from 1 to 10 kilometers has roughly the same effect as moving from 10 to 100 kilometers from the earthquake.

Scatterplot of the bracketed duration and the distance from the epicenter, the latter displayed on a logarithmic scale.
Figure 22.1: Relationship between the bracketed duration and the distance from the epicenter of an earthquake for locations measuring seismic activity in Greece.

22.2 Development of Statistical Model

In order to address our primary question of interest, we must develop a statistical model which explains the data generating process and embeds our question of interest in terms of the parameters of the model. Based on the question of interest, and our observations in the graphical exploration above, our model explaining the generation of the bracketed duration should depend on the epicentral distance. Further, this relationship should be on a logarithmic scale to account for the “stretched” scale in Figure 22.1.

At first, it may seem that the logarithmic scale prevents our use of the “linear” model we discussed in this unit, but the model is flexible enough to accommodate this development. We propose the following model for the data generating process:

\[(\text{Bracketed Duration})_i = \beta_0 + \beta_1 \log_{10}(\text{Epicentral Distance})_i + \varepsilon_i. \tag{22.1}\]

Notice that instead of just the epicentral distance, our predictor is the base-10 logarithm of the epicentral distance. This transformed variable enters the model just as any other predictor. In addition to modeling the deterministic portion of the data generating process, we must also place conditions on the stochastic portion in order to make inference. We consider the conditions of the classical regression model:

  • The error in the bracketed duration is 0, on average for all epicentral distances; that is, the model for the deterministic portion is correctly specified.
  • The error in the bracketed duration for one location is independent of the error in the bracketed duration for any other location.
  • The variability of the error in the bracketed duration is the same for all locations regardless of the epicentral distance.
  • The error in the bracketed duration follows a Normal distribution.

Before imposing these conditions, however, we should assess them.

22.3 Assessment of Conditions

Before making inference regarding our question of interest, we should determine if our data is consistent with the conditions on the error term we have specified. Figure 22.2 is a probability plot of the residuals used to assess whether the data is consistent with the Normality condition.

Scatterplot of the residuals against the theoretical quantiles from a Normal distribution.
Figure 22.2: Normal probability plot of the residuals corresponding to a model for the bracketed duration of seismic events in Greece as a function of the epicentral distance.

The plot reveals some departure from the linear relationship we would expect if the errors followed a Normal distribution. It seems unreasonable to impose the Normality condition.

Figure 22.3 is a plot of the residuals for the observations in the order in which they were collected. Since the data was collected over time, this plot could reveal potential patterns among the residuals which suggest a departure from independence among the errors. As there are no trends in either the location or spread of the residuals, the data is consistent with the independence conditions.

Line plot of residuals over time exhibiting no pattern in location or spread.
Figure 22.3: Time-series plot of the residuals corresponding to a model for the bracketed duration of seismic events in Greece as a function of the epicentral distance.

Figure 22.4 is a plot of the residuals against the predicted values from the deterministic portion of the model. There are no obvious trends in the location of the residuals; that is, the residuals balance out around 0 for all predicted values. Therefore, we are willing to the data is consistent with the condition that the mean of the errors is 0 for each combination of the predictors. We note that while the points balance out around 0 (as indicated by a smoother that remains near 0), they tend to have a larger range on the positive side compared to the negative side; this relates back the observation we made earlier that the residuals do not behave like a sample from a Normal distribution.

We also note that the spread of the residuals is larger on the right side of the graphic compared to the left side. As the predicted values increase, the spread of the residuals also increases. This suggests that for larger bracketed durations, the model is not as precise. Therefore, we are not willing to impose the constant variance condition.

Scatterplot of residuals against predicted values for which the spread increases as you move left to right across the graphic.
Figure 22.4: Plot of the residuals against the predicted values corresponding to a model for the bracketed duration of seismic events in Greece as a function of the epicentral distance.

Examining the residuals, we determined that the data is consistent with the following two conditions:

  • The error in the bracketed duration for one location is independent of the error in the bracketed duration for any other location.
  • The error in the bracketed duration is 0, on average, for all epicentral distances; that is, the deterministic portion of the model for the data generating process is correctly specified.

Since we are only willing to assume these two conditions, we will use an empirical model (bootstrap procedure) for the sampling distribution of the least squares estimates.

22.4 Summary of Model Fit

The parameters in our model are estimated via the method of least squares. The variability in these estimates is quantified using an empirical model of the sampling distribution based on 5000 bootstrap replications. Table 22.1 summarizes the estimates for the parameters in Equation 22.1.

Table 22.1: Summary of the model characterizing the bracketed duration of seismic events in Greece as a function of the epicentral distance.
Term Estimate Standard Error 95% Lower CI 95% Upper CI
(Intercept) 2.553 1.031 0.464 4.561
log10(Epicentral Distance) 2.226 0.985 0.347 4.248

The results suggests that for each 10-fold increase in the number of kilometers a location is from the epicenter of an earthquake (1-unit increase in the base-10 logarithm), the bracketed duration increases between 0.347 and 4.248 seconds, on average. This result is counter-intuitive; it suggests that the further a location is from an earthquake, the longer it is subjected to major ground motion. First, remember that this is data from an observational study; so, the result is not suggesting a causal relationship. Indeed, confounding is playing a role here. Figure 22.5 examines the relationship between the bracketed duration and the epicentral distance in addition to the magnitude of the earthquake and the soil conditions of the location. While it is somewhat difficult to see, note that earthquakes with the largest magnitudes also tended to be recorded further away. While we would need to confirm with a discipline expert, it suggests that the major fault line is approximately 100 kilometers away from the Greek observation stations. As a result, the recording locations in the study experienced the largest amount of major motion when these large quakes occurred. A more robust analysis using the general form of the linear regression model of the previous chapter would reveal that after accounting for these additional terms, the data is consistent with decreases in the bracketed durations, on average, for locations further from the center of an earthquake.

Three scatterplots of the bracketed duration against the epicentral distance with the color of the points corresponding to the magnitude; each scatterplot is for locations of a different soil type.
Figure 22.5: Relationship between the bracketed duration and the distance from the epicenter of an earthquake for locations measuring seismic activity in Greece. The relationship is presented for various soil types.

It is important to always place our analyses in the context of the problem and consult with discipline experts. This can sometimes reveal shortcomings in the proposed model for the data generating process. Remember, statistical analyses are not a crystal ball; they reveal potential patterns, but only those that we incorporate into our model for the data generating process.