Computational data analysis is an essential part of modern statistics. This course introduces tools and strategies for data management, manipulation and analysis that are common in statistics and data science. These skills are developed primarily in the R computing language. Topics include:

The class is very applied, emphasizing a "learn by trying" approach. There are some short computational checks each week, but the majority of the class is spent in working on projects corresponding to each of the topics in the course. When students leave the course, they should have developed an excellent portfolio of work to showcase their skills. The class stresses communication of your approach to addressing the problem as well as the programming implementation.

Students should have a background in statistics (introductory statistics course) and prior experience in programming (any language).

Learning Objectives

Data is everywhere, and tools are needed to turn data into information. This course discusses the use of scripting in a statistical computing language to aid in this process. At the end of this course, students will be able to perform the following tasks:

  1. Associate a computational task with the appropriate function(s) or package(s) (suites of related functions) in a statistical computing language.
  2. Given a script, describe the computational task being performed.
  3. Given a computational task from the workflow of a statistical project, construct a script to complete the task.
  4. Given a research objective, integrate multiple computational tasks to provide a data-driven conclusion.
  5. Communicate the solution to a computational task, identifying and describing key steps/chunks within the solution.
  6. Express the value of scripting a solution to a computational task.
  7. Identify resources that generalize the material covered in class in order to learn new tools for solving a novel computational task.

Course Structure

This course is delivered as a flipped hybrid course. That is, the content is primarily delivered remotely via readings, videos, and tutorials. One class period is reserved for an activity to gain more exposure to the tools and methods discussed in that module. A second class period is reserved as a workshop where students can get immediate feedback on their module projects.

The course is divided into 10 modules (one per week):

  1. Reproducible Research: We introduce the concept of reproducible research - being able to trace the conclusions from a study to the data and analysis which resulted in those conclusions. As an example, should we update the data, any computed summaries should update as well. R and Rmarkdown documents will be introduced as tools for easily facilitating reproducible research and collaboration among members. We will also introduce basic functionality in R, such as reading in a CSV file containing a dataset and performing basic computations.
  2. Tidy Data: When we think of data, we often envision a spreadsheet. However, not all spreadsheets are created equal. We introduce principles (known as "tidy data") for storing data that make it easy to work with. We also introduce the "tidyverse" as a suite of packages and tools for working with tidy data and the corresponding key verbs.
  3. Static Graphics: A good graphical summary is often more valuable than an expertly constructed model. We will introduce the grammar of graphics as the foundation for creating a graphical summary. We then introduce the ggplot package as a tool that implements this grammar.
  4. Programming: Data is rarely in the format needed for analysis. The process of manipulating data (wrangling) necessarily involves programming. This involves writing custom functions and applying functions efficiently within various data structures. We will introduce the benefit of vectorizing operations and tools for implementing vectorized code.
  5. String Manipulation: Character data brings with it challenges that numeric data does not. We will introduce some of these challenges as well as common base operations with strings. In addition, we will touch on the use of regular expressions for manipulating character data.
  6. Data Input/Output: While we often think of datasets as tabular; data does not always begin in this format. We introduce a method of reading in unstructured data. Primarily, we discuss web-scraping, obtaining data from html/xml sources through CSS selectors.
  7. Dynamic Graphics: Some graphical summaries are improved by dynamic or interactive elements. This includes maps, time-laps motion, and interactivity (hovering, etc.). We will discuss some common functionality and introduce plotly as a tool for implementation.
  8. Simulations: Numerical simulations can be used to investigate complex processes as well as evaluate the performance of various statistical methods. These are particularly useful when we have a firm model for the underlying components of the process and how these components fit together. We introduce the steps in conducting a simulation study and tools for generating random variates.
  9. Randomization-Based Inference: We have seen the use of numerical simulation for investigating a process or examining the properties of a statistical method. In this module, we discuss the applications of simulations to inference. We discuss bootstrapping and randomization-based hypothesis testing.
  10. Choose Your Own Adventure: The course only scratches the surface on the topics introduced, and it excludes a vast number of beautiful topics that are part of the statistical analysis pipeline. We provide space for each student to investigate a topic of their choice related to the statistical analysis pipeline. This could include a topic tangential to one discussed, an extension of methods discussed, or a brand new topic.

Course Materials

The course makes use of R for Data Science by Grolemund and Wickham for the first half of the course. This acts as the foundation of the course. The last half of the course will depend on vignettes, original articles, and course notes to cover more advanced topics in statistical computing.

Lectures focus on broad concepts common to all statistical computing languages, and while the text and instructor primarily support R, students are free to choose other languages (Python, Julia, SAS, etc.). In practice, analysts commonly move between multiple tools to address a problem; however, for the course, it is advised that you stick with a single language in order to focus on the concepts.

Course tutorials are delivered via the learnr package.