MA223 Sample

Scientists and engineers are asked to make decisions on a regular basis, and often times, those decisions need to be supported by data. This course introduces statistical concepts in the context of engineering and the sciences. The course emphasizes statistical literacy (speaking the language of statistics and interpreting statistical methods and results), reasoning (defining the need for data to address questions, understanding the role and impact of variability, and understanding foundational concepts), design (approaches to data collection and the impact of the study design on the conclusions), and analysis (choosing an appropriate methodology and using a statistical computing environment to carry out an analysis). The following topics are included in the course:

Study design, including random sampling schemes as well as a comparison of observational studies and controlled experiments.
Graphical and numerical summaries for a single quantitative variable, the relationship between a quantitative response and a categorical factor, and the relationship between two quantitative variables.
Compare and contrast the distribution of the population, the sample, the sampling distribution, and the null distribution for a statistic of interest.
Models for describing the sampling distribution of a statistic, including bootstrapping and parametric models.
Confidence intervals for a single mean and parameters of a simple linear regression model.
Hypothesis testing for a single mean, comparing multiple means across two or more groups, and the parameters of a simple linear regression model.

The above list includes common methods encountered in engineering and the sciences including linear regression, ANOVA, and repeated measures ANOVA. The class is taught from a model-based perspective. Therefore, we describe approaches to collecting data, summarizing the information contained within the data, building a model to address a question of interest, using data to estimate the unknowns in the model, assessing the model, and interpreting the results based on the model.

Decisions need to be made in industry and science. The course discusses the collection and use of data for the purpose of decision making in the presence of variability. The course is built on five fundamental idea:

A reseearch question can often be framed in terms of a parameter which characterizes the population. Framing the question should then guide our analysis.
If data is to be useful for making conclusions about the population, a process referred to as drawing inference, proper data collection is crucial. Randomization can play an important role ensuring a sample is representative and that inferential conclusions are appropriate.
The use of data for decision making requires the data be summarized and presented in ways that address the question of interest.
Variability is inherent in any process, and as a result, our estimates are subject to sampling variability. However, these estimates often vary across samples in a predictable way; that is, they have a distribution that can be modeled.
With a model for the distribution of a statistic under a proposed model for the data generating process, we can quantify the likelihood of an observed sample under that proposed model. This allows us to draw conclusions about the corresponding parameter, and therefore the population, of interest.

Learning Objectives

At the end of this course, students should be able to perform the following tasks:

Given a problem description, identify the population and parameter(s) of interest as well as the statistic(s) from the sample appropriate for estimating the parameter(s). If appropriate, formulate an appropriate set of statistical hypotheses that address the research goal.
Describe the importance of considering the data collection scheme when interpreting the results of a study, including potential confounding, replicability, and generalizability. Given a problem description, identify potential reasons for variability in the observed response and the limitations of the data collection scheme.
Construct and interpret graphical and numerical summaries of data to address a given question of interest.
Describe general techniques for modeling the sampling distribution of a statistic, and discuss the role of a sampling distribution in inference.
Given a question of interest, conduct an appropriate statistical analysis (using either confidence intervals or p-values) in order to aid in decision making, and given a statistical analysis, interpret the results of the analysis in the context of the problem.
Comment on the adequacy of a statistical method for addressing a given question of interest by assessing the conditions underlying the method.
Develop a question of interest; then, design and implement a study to address the question.
Identify the value of statistical methodology in the advancement of science as well as recognize its limitations.
Collaborate with others to conduct data collection and a statistical analysis and communicate the results appropriately.
Support a decision using graphical and/or numerical data.

As students progress through the course, objectives specific to each module will be given; accomplishing these module-level objectives will help students succeed in accomplishing the course-level objectives.

Course Structure

I run the class as a flipped course. That is, the primary content is delivered remotely through course readings and videos which are accompanied by a note packet. Class meetings are reserved for activities and discussions surrounding key concepts, group work on homework, and answering questions on assignments.

I also offer an online version of the course which is delivered asynchronously. Self-paced activities replace the classroom discussions. I am available to answer questions remotely via Microsoft Teams to answer questions on assignments.

The course material is broken into nine modules (one covered each week during the term, leaving a week for reviewing for the final exam and getting into the course). The first three modules form a unit on inference for a single population, the second three modules form a unit on the simple linear regresison model (including comparing two groups), and the last three modules form a unit on comparing multiple group means (including repeated measures).

Statistical Process: The discipline of statistics is about turning data into usable information. More, we would like that information to apply on a broader scale than just the observed subjects. In this section, we introduce some of the language necessary for describing this process, known as statistical inference. We also discuss how we construct well-posed questions that can be addressed using data and appropriately frame questions using statistical language. Finally, we discuss data presentation.
Sampling Distributions: Doing the same study twice will result in different data, and different data will result in different estimates and inference. How then can we have confidence in our estimates? This unit develops a notion of confidence which acknowledges the variability in an estimate that results from the variability in the sampling process. We then use this notion to conduct inference on the unknown parameter of interest.
Null Distribution: Determining whether the data is consistent with the null hypothesis requires determining what our expectations are if the null hypothesis were true. This unit develops a notion of how we can quantify our expectations. We then use this notion to conduct inference on the unknown parameter of interest.
Simple Linear Regression: Most questions of interest are about the relationship between two variables. In this unit, we consider a model for the data generating process which relates a quantitative response to a quantitative predictor linearly.
Partitioning Variability: In order to compare two models for the how the response is generated, we must ask the question: why isn't the response the same for each subject? This process allows us to determine the amount the predictor contributes to the response.
Assessing Conditions in Regression: Once we have a model for the data generating process, we can develop a model for the sampling distribution of our parameter estimates or the null distribution of the standardized statistic. However, their construction relies on the conditions we place on the stochastic portion of the model. We should always assess whether the data is consistent with these conditions before employing the model.
ANOVA: When we want to compare the mean of a quantitative response across several groups, we must extend our model for the data generating process. While at first glance it appears very disconnected from what we have been studying, the key idea is simply to partition the variability.
Assessing Conditions in ANOVA: The p-value we obtain in an ANOVA table is only reliable if the data is consistent with the conditions on which the corresponding model for the null distribution was constructed. We describe ways of assessing these conditions graphically.
Block Designs: It is beneficial to account for additional sources of variability in the response; it is crucial when those sources result in the responses being correlated with one another. In this unit, we discuss analytical methods for generalizing our ANOVA model to account for correlated responses, and how to capitalize on this in the study design.

Statistics is a unique discipline in that it exists solely to aid decision-makers in other fields. This course seeks to improve your statistical literacy and reasoning such that you could successfully collaborate on a small research study. This course can also serve to launch you into further statistics coursework to eventually contribute to the analysis of a study. In order to assess your progress toward the course goals, several types of assignments are given throughout the course. Each type of assignment assesses a different aspect of the course; some things are best learned in groups while others should be mastered at the individual level. As a result, the grading scheme uses a mix of components as well. In order to help you achieve the objectives of the course, I will be implementing a variation on "specifications grading." That is, instead of taking a weighted average of points earned on a series of assignments throughout the term, grades are earned based on establishing competency (across the course) in four areas (Statistical Literacy, Statistical Reasoning, Statistical Design, and Statistical Analysis). While no partial credit is awarded, very clear expectations are provided to help students demonstrate competency.

Course Materials

Teaching this course regularly for years, I drifted into teaching this course from a model-based perspective. However, I have not found a lot of texts which support this perspective with the coverage needed for this course (Kaplan's Statistical Modeling: A Fresh Approach is close). So, I compiled my course notes into an online text (Statistical Foundations for Engineers and Scientists) that I use for the course. Course note packets are given to students each week to help direct their reading.

I have also created some "concept clips" (videos explaining key concepts from the course) and "example videos" (walking through examples with the course software); these go along with the note packets as well. If interested, these videos can be accessed on Panopto.

I teach the class in RStudio; while this is a free program, our IT department has set up a server so students have access to RStudio via the web browser. I have created an R package (IntroAnalysis) to accompany the course. In addition to custom functions for conducting analysis from a model-based perspective, the package contains tutorials which can be accessed using the learnr package.