Introduction

Who We Are

Chester Ismay: Reed College & Pacific University
- Email: chester.ismay@gmail.com
- GitHub: ismayc
- Twitter: @old_man_chester
Albert Y. Kim: Middlebury College
- Email: albert.ys.kim@gmail.com
- GitHub: rudeboybert
- Twitter: @rudeboybert

Outline of Workshop

Google Doc at <http://bit.ly/uscots17-agenda>

Our Textbook

An Introduction to Statistical and Data Sciences via R
Webpage: http://moderndive.com. GitHub Repo
In you haven’t already, please signup for our mailing list!

Albert’s Course (Intro to Statistical & Data Sciences)

Available in the supplementary HTML document here.

Webpage and GitHub Repo
Administrative:
- Chief non-econ/bio stats service class at Middlebury
- 12 weeks each with 3h “lecture” + 1h “lab”
- No prerequisites
Students:
- ~24 students/section of all years/backgrounds. Only stats class many will take
- Background: Many had AP stats, some with programming
- All had laptops that they brought everyday
Topic List
- First half is data science: data visualization, manipulation, importing
- Second half is intro stats: sampling, hypothesis tests, CI, regression
Evaluation
- 10%: weekly problem sets
- 10%: engagement
- 45%: 3 midterms (last during finals week)
- 35%: Final projects
Typical Classtime:
- First 10-15min: Priming topic, either via slides or chalk talk
- Remainder: Students read over text & do Learning Checks in groups and without direct instructor guidance.

What Are We Doing And Why?

Data first! Start with data science via tidyverse,
then stats builds on these ideas.
Replacing the mathematical/analytic with computational/simulation-based whenever possible.
The above necessitates algorithmic thinking, computational logic and some coding/programming.
Complete reproducibility

1) Data First!

Cobb (TAS 2015): Minimizing prerequisites to research. In other words, focus on entirety of Wickham/Grolemund’s pipeline…

1) Data First!

Furthermore use data science tools that a data scientist would use. Example: tidyverse

1) Data First!

What does this buy us?

Students can do effective data storytelling
Context for asking scientific questions
Look at data that’s rich, real, and realistic. Examples: Data packages such as nycflights13 and fivethirtyeight
Better motivate traditional statistical topics

2) Computers, Not Math!

Cobb (TAS 2015): Two possible “computational engines” for statistics, in particular relating to sampling:

Mathematics: formulas, probability theory, large-sample approximations, central limit theorem

–

Computers: simulations, resampling methods

2) Computers, Not Math!

We present students with a choice for our “engine”:

Either we use this…	Or we use this…

–

Almost all are thrilled to do the latter

–

Leave “bread crumbs” for more advanced math/stats courses

2) Computers, Not Math!

What does this buy us?

Emphasizes: stats is not math, rather stats uses math.
Simulations are more tactile
Reducing probability and march to CLT, this frees up space in syllabus.

3) Algorithms, Computation, & Coding

Both “Data First!” and “Computers, Not Math!” necessitate algorithmic thinking, computational logic, and some coding/programming.
Battle is more psychological than anything:
- “This is not a class on programming!”
- “Computers are stupid!”
- “Learning to code is like learning a foreign language!”
- “Early on don’t code from scratch! Take something else that’s similar and tweak it!”
- Learning how to Google effectively

3) Algorithms, Computation, & Coding

Why should we do this?

Data science and machine learning.
Where statistics is heading. Gelman blog post.
If we don’t, we are doing a disservice to students by shielding them from these computational ideas.
Bigger picture: Coding is becoming a basic skill like reading and writing.

4) Complete Reproducibility

Students learn best when they can take apart a toy (analysis) and then rebuild it (synthesis).
Crisis in Reproducibility
Ultimately the best textbook is one you’ve written yourself.
- Everyone has different contexts, backgrounds, needs
- Hard to find one-size-fits-all solutions
A new paradigm in textbooks? Versions, not editions?

Let’s Dive In!

Baby’s First Bookdown

ModernDive Light: Just Data Science Chapters of Bookdown
Download this ZIP file & extract the contents to a folder on your computer master.zip
Double click moderndiver-lite.Rproj to open in RStudio
Build -> Build Book
- install.packages('knitr', repos = c('http://rforge.net', 'http://cran.rstudio.org'), type = 'source')

Getting Started

DataCamp

DataCamp offers an interactive, browser based tool for learning R/Python. Their two flagship R courses, both of which are free:

Intro to R
Intermediate R courses

DataCamp

Outsource many essential but not fun to teach topics like

Idea of command-line vs point-and-click
Syntax: Variable names typed exactly, parentheses matching
Algorithmic Thinking: Linearity of code, object assignment
Computational Logic: boolean algebra, conditional statements, functions

DataCamp Pros

Can assign “Intro to R” first day of class as “Homework 0”
Outsourcing allows you to free up class time
Students get immediate feedback on whether or not their code works
- Often, the DataCamp error messages are much more useful than the ones R gives

DataCamp Pros

With their free academic license, you can
- Form class “Groups” and assign/track progress of DataCamp courses
- Have free access to ALL their courses, including ggplot2, dplyr, rmarkdown, and RStudio IDE.
- Create your own free DataCamp course covering content you want your students to learn using R

DataCamp Cons

Some students will still have trouble; you can identify them however.
The topics in these two free courses may not align with your syllabus. You can assign at chapter level instead of course level though

DataCamp Conclusion

Not a good tool for “quick retention,” but for R concept introduction and subsequent repetition.
Students need to practice “speaking the language” just like with a foreign language.
Feedback from students was positive.
Battle is more psychological than anything. DataCamp reinforces to students that
- “Computers are stupid!”
- “Learning to code is like learning a foreign language!”

Chester’s First Bookdown Project

Getting used to R, RStudio, and R Markdown

Designed to provide students with GIFs to follow along with and a description of all the components of RStudio and R Markdown

Short break?

Important R ideas for students to know ASAP

Vector/variable - Type of vector (int, num, chr, logical, date)

–

Data frame - Vectors of (potentially) different types - Each vector has the same number of rows

Welcome to the tidyverse!

The tidyverse is a collection of R packages that share common philosophies and are designed to work together.

Chapter 3: Tidy Data?

Drawing

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

The third point means we don’t mix apples and oranges.

What is Tidy Data?

Each observation forms a row. In other words, each row corresponds to a single instance of an observational unit
Each variable forms a column:
- Some variables may be used to identify the observational units.
- For organizational purposes, it’s generally better to put these in the left-hand columns
Each type of observational unit forms a table.

Differentiating between neat data and tidy data

Colloquially, they mean the same thing
But in our context, one is a subset of the other.

Neat data is - easy to look at, - organized nicely, and - in table form.

–

Tidy data is neat but also abides by a set of three rules.

Drawing

Is this tidy?

How about this? Is this tidy?

Why is tidy data important? slide

Beginning steps

Frequently the first thing to do when given a dataset is to

check that the data is tidy,
identify the observational unit,
specify the variables, and
give the types of variables you are presented with.

This will help with

choosing the appropriate plot,
summarizing the data, and
understanding which inferences can be applied.

Chapter 4: Data Viz

Inspired by Hans Rosling

What are the variables here?
What is the observational unit?
How are the variables mapped to aesthetics?

Grammar of Graphics

Wilkinson (2005) laid out the proposed
“Grammar of Graphics”

Grammar of Graphics in R

Wickham implemented the grammar in R
in the ggplot2 package

What is a statistical graphic?

–

A `mapping` of
`data` variables

–

to
`aes()`thetic attributes

–

of
`geom_`etric objects.

Back to basics

Consider the following data in tidy format:

Sketch the graphics below on paper, where the x-axis is variable A and the y-axis is variable B

A scatter plot
A scatter plot where the color of the points corresponds to D
A scatter plot where the size of the points corresponds to C
A line graph
A line graph where the color of the line corresponds to D with points added that are all green of size 4.

Reproducing the plots in `ggplot2`

1. A scatterplot

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point()

–

Reproducing the plots in `ggplot2`

2. A scatter plot where the `color` of the points corresponds to `D`

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))

–

Reproducing the plots in `ggplot2`

3. A scatter plot where the `size` of the points corresponds to `C`

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()

–

Reproducing the plots in `ggplot2`

4. A line graph

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line()

–

Reproducing the plots in `ggplot2`

5. A line graph where the `color` of the line corresponds to `D` with points added that are all blue of size 4.

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "blue", size = 4)

–

Why is tidy data important?

Think about trying to plot democracy score across years in the simplest way possible with the data on the Is this tidy? slide.
It would be much easier if the data looked like what follows instead so we could put
- year on the x-axis and
- dem_score on the y-axis.

Tidy is good

Let’s plot it

Plot the line graph for 4 countries using ggplot

dem_score4 <- dem_score_tidy %>%
  filter(country %in% c("Australia", "Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
  geom_line(mapping = aes(color = country))

The Five-Named Graphs

The 5NG of data viz

Scatterplot: geom_point()
Line graph: geom_line()
Histogram: geom_histogram()
Boxplot: geom_boxplot()
Bar graph: geom_bar()

More examples

Histogram

library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
  geom_histogram(bins = 20, color = "black", fill = "darkorange")

Boxplot (broken)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
  geom_boxplot()

Boxplot (fixed)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = factor(month), y = humid)) +
  geom_boxplot()

Bar graph

library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
  geom_bar()

How about over time?

Hop into dplyr

library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
               "'90-'94", "'95-'99", "'00-'04", "'05-'09",
               "'10-'13")
bechdel <- bechdel %>%
  mutate(five_year = cut(year, 
                         breaks = seq(1969, 2014, 5), 
                         labels = year_bins)) %>% 
  mutate(clean_test = factor(clean_test, 
                             levels = c("nowomen", "notalk", "men",
                                        "dubious", "ok")))

How about over time? (Stacked)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar()

How about over time? (Side-by-side)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "dodge")

How about over time? (Stacked proportional)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "fill", color = "black")

`ggplot2` is for beginners and for data science professionals!

Practice

Produce appropriate 5NG with R package & data set in [ ], e.g., [nycflights13 \(\rightarrow\) weather]

Does age predict recline_rude?
[fivethirtyeight \(\rightarrow\) na.omit(flying)]
Distribution of age by sex
[okcupiddata \(\rightarrow\) profiles]
Does budget predict rating?
[ggplot2movies \(\rightarrow\) movies]
Distribution of log base 10 scale of budget_2013
[fivethirtyeight \(\rightarrow\) bechdel]

HINTS

DEMO in RStudio

Determining the appropriate plot

Chapter 5: Data Wrangling

`gapminder` data frame in the `gapminder` package

library(gapminder)
gapminder

Base R versus the `tidyverse`

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)

[1] 60.0649

–

library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

The pipe `%>%`

A way to chain together commands
It is essentially the dplyr equivalent to the
+ in ggplot2

## The 5NG of data viz

`geom_point()`
`geom_line()`
`geom_histogram()`
`geom_boxplot()`
`geom_bar()`

The Five Main Verbs (5MV) of data wrangling

`filter()`
`summarize()`
`group_by()`
`mutate()`
`arrange()`

`filter()`

Select a subset of the rows of a data frame.
The arguments are the “filters” that you’d like to apply.

library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)

Use == to compare a variable to a value

Logical operators

Use | to check for any in multiple filters being true:

gapminder %>% 
  filter(year == 2002 | continent == "Europe")

–

Logical operators

Use & or , to check for all of multiple filters being true:

gapminder %>% 
  filter(year == 2002, continent == "Europe")

Logical operators

Use %in% to check for any being true
(shortcut to using | repeatedly with ==)

gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))

–

`summarize()`

Any numerical summary that you want to apply to a column of a data frame is specified within summarize().

max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997

–

Combining `summarize()` with `group_by()`

When you’d like to determine a numerical summary for all levels of a different categorical variable

max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp),
            sd_exp = sd(lifeExp))
max_exp_1997_by_cont

–

`ggplot2` revisited

For aggregated data, use geom_col

ggplot(data = max_exp_1997_by_cont, 
       mapping = aes(x = continent, y = max_exp)) +
  geom_col() +
  geom_errorbar(mapping = aes(ymin = max_exp - sd_exp,
                              ymax = max_exp + sd_exp))

The 5MV

filter()
summarize()
group_by()

–

mutate()

–

arrange()

`mutate()`

Allows you to
1. create a new variable with a specific value OR
2. create a new variable based on other variables OR
3. change the contents of an existing variable

–

gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)

`mutate()`

Allows you to
1. create a new variable with a specific value OR
2. create a new variable based on other variables OR
3. change the contents of an existing variable

–

gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)

`mutate()`

Allows you to
1. create a new variable with a specific value OR
2. create a new variable based on other variables OR
3. change the contents of an existing variable

–

gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)

`arrange()`

Reorders the rows in a data frame based on the values of one or more variables

gapminder %>%
  arrange(year, country)

`arrange()`

Can also put into descending order

gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp)) %>%
  head(10)

Don’t mix up `arrange` and `group_by`

group_by is used (mostly) with summarize to calculate summaries over groups
arrange is used for sorting

Don’t mix up `arrange` and `group_by`

This doesn’t really do anything useful

gapminder %>% group_by(year)

Don’t mix up `arrange` and `group_by`

But this does

gapminder %>% arrange(year)

Changing of observation unit

True or False

Each of filter, mutate, and arrange change the observational unit.

–

True or False

group_by() %>% summarize() changes the observational unit.

What is meant by “joining data frames” and
why is it useful?

–

Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?

library(fivethirtyeight)
library(readr)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police_locals, y = ideology, by = "city")
rmarkdown::paged_table(police_join)

cost_of_living <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/cost_of_living.csv")
police_join_cost <- inner_join(x = police_join, y = cost_of_living, by = "state")
rmarkdown::paged_table(police_join_cost)

Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?

ggplot(data = police_join_cost,
       mapping = aes(x = index, y = all)) +
  geom_point(aes(color = state_ideology)) +
  labs(x = "Cost of Living Index", y = "% Officers Living in City")

Practice

Use the 5MV to answer problems from R data packages, e.g., [nycflights13 \(\rightarrow\) weather]

What is the maximum arrival delay for each carrier departing JFK? [nycflights13 \(\rightarrow\) flights]
Calculate the domestic return on investment for 2013 scaled data descending by ROI
[fivethirtyeight \(\rightarrow\) bechdel]
Include the name of the carrier as a column in the flights data frame
[nycflights13 \(\rightarrow\) flights, airlines]

DEMO in RStudio

Statistical Inference

We now enter the “traditional” topics of intro stats…

Sampling theory
Hypothesis testing
Confidence intervals
Regression

Statistical Inference

… but now students are armed with

Data visualization
Data wrangling skills
Most important: comfort with coding!

Chapter 6: Sampling Highlights

Sampling is at the root of statistics. Two approaches:

Either we use this…	Or we use this…

`mosaic` Package

It has functions for random simulations:

rflip(): Flip a coin
shuffle(): Shuffle a set of values
do(): Do the same thing many, many, many times
resample(): the swiss army knife for sampling

Lady Tasting Tea

Presented to Students As:

Say you are a statistician and you meet someone called the “Lady Tasting Tea.”
She claims she can tell by tasting whether the tea or the milk was added first to a cup.
You want to test whether
- She can actually tell which came first
- She’s lying and is really guessing at random
Say you have just enough tea/milk to pour into 8 cups.

Lady Tasting Tea

The example will be built around this code: (Available in the supplementary HTML document here.)

library(ggplot2)
library(dplyr)
library(mosaic)

single_cup_guess <- c(1, 0)
simulation <- do(10000) * resample(single_cup_guess, size=8, replace=TRUE)
View(simulation)

simulation <- simulation %>% 
  mutate(n_correct = V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8) 
View(simulation)

ggplot(simulation, aes(x=n_correct)) + 
  geom_bar() +
  labs(x="Number of Guesses Correct", title="Number Correct Assuming She Is Guessing at Random") +
  geom_vline(xintercept=8, col="red")

Chapter 7: Hypothesis Testing Highlights

There is only one test; it has 5 components:

Define \(H_0\) and \(H_A\)
Define the test statistic \(\delta\)
Compute the observed test statistic \(\delta^*\)
Construct the null distribution either
- Mathematically
- Via Simulation
Compare \(\delta^*\) to null distribution to compute p-value

There is Only One Test: Lady Tasting Tea

She is guessing at random vs she can tell which came first
Test statistic: Number out of 8 shes guesses right
Observed test statistic: 8 of 8. The red line!
Null distribution: (simulated) bar graph!
p-value: Very small! Above 0.36%

There is Only One Test: Goodness-of-Fit

Observations fit expected distibution vs not
Test statistic: \(\sum_{i=1}^{k}\frac{\left(\mbox{Obs}_i-\mbox{Exp}_i\right)^2}{\mbox{Exp}_i}\)
Observed test statistic: Compute using data!
Null Distribution: (mathematically derived)
Chi-Squared Dist’n.
Area to the right!

There is Only One Test

Created by Allen Downey

Two-Sample Permutation Test

Posed to students: Did students with an even # of letters in last name do better than those with odd #? (Available in the supplementary HTML document here.)

library(tidyverse)
library(mosaic)
grades <- read_csv("https://raw.githubusercontent.com/ismayc/moderndive-workshops/master/docs/slides/data/grades.csv")
View(grades)

# Observed Difference: Using mosaic package's mean function
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()

null_distribution <- (do(1000) * mean(final ~ shuffle(even_vs_odd), data=grades)) %>%
  mutate(difference=odd-even)
View(null_distribution)

# Plot
ggplot(data=null_distribution , aes(x=difference)) +
  geom_histogram(binwidth = 0.025) +
  labs(x="Avg of Odds - Avg of Evens") +
  geom_vline(xintercept = observed_diff, col="red")

Chapter 8: Confidence Intervals Highlights

Not only show students repeated sampling (with a Shiny app for example), let them repeatedly sample!
In other words, let them construct sampling distributions.

Sampling Distribution, SE, & C.I. Example

The example will be built around this code: (Available in the supplementary HTML document here)

Discuss with your seatmates what all 5 code parts below are doing.
Try increasing n and repeating. What does this correspond to doing in real life?
How does the histogram change?
Describe using statistical language the role n plays when it comes to estimating \(\mu\).

library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)

# For simplicity, remove 3 individuals who did not list their height
profiles <- profiles %>% 
  filter(!is.na(height))

# Population mean
mu <- mean(profiles$height)

# Sample size:
n <- 5

# Parts 1 & 2:
resample(profiles$height, size=n, replace=TRUE)
mean(resample(profiles$height, size=n, replace=TRUE))

# Part 3:
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=TRUE))
View(samples)

# Part 4:
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  labs(x="sample mean") + 
  xlim(c(50,80)) +
  geom_vline(xintercept=mu, col="red")

# Part 5:
sd(samples$mean)

The Hard Part

Convincing students:

We only do the blue via simulation as a theoretical exercise
We do the purple in real life

Chapter 9: Regression Highlights

Experience with ggplot2 package and knowledge of the Grammar of Graphics primes students for regression
Use of the broom package to unpack regression

1. `ggplot2` Primes Regression

Mapping aesthetics to variables provides a natural framework for all of data visualization. Understanding the relationships between variables is clear and transparent from the ggplot2 code.
This ultimately what regression is about!

1. `ggplot2` Primes Regression

Example:

All Alaskan Airlines and Frontier flights leaving NYC in 2013
We want to study the relationship between temperature and departure delay
For summer (June, July, August) and non-summer months separately

Involves four variables:

carrier, temp, dep_delay, summer

1. `ggplot2` Primes Regression

Why? Dig deeper into data. Look at origin and dest variables as well:

2. `broom` Package

The broom package takes the messy output of built-in modeling functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
Fits in with tidyverse ecosystem
This works for many R data types!

2. `broom` Package

In our case, broom functions take lm objects as inputs and return the following in tidy format!

tidy(): regression output table
augment(): point-by-point values (fitted values, residuals, predicted values)
glance(): scalar summaries like \(R^2\),

2. `broom` Package

The chapter will be built around this code: (Available in the supplementary HTML document here).

library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)

# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% 
  sample_n(50)
View(alaska_flights)


# Exploratory Data Analysis----------------------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point()

# Correlation coefficient:
alaska_flights %>% 
  summarize(correl = cor(dep_delay, arr_delay))

# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")


# Fit Regression and Study Output with broom Package---------------------------
# Fit regression
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)

# 1. broom::tidy() regression table with confidence intervals and no p-value stars
regression_table <- delay_fit %>% 
  tidy(conf.int=TRUE)
regression_table %>% 
  kable(digits=3)

# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>% 
  augment() %>% 
  select(arr_delay, dep_delay, .fitted, .resid) 
regression_points %>% 
  head() %>% 
  kable(digits=3)

# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>% 
  augment(newdata = new_flights) %>% 
  kable()

# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>% 
  glance() 
regression_summaries %>% 
  kable(digits=3)


# Residual Analysis------------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
  geom_histogram(binwidth=10) +
  geom_vline(xintercept = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
  stat_qq()


# Preview of Multiple Regression-----------------------------------------------
flights_subset <- flights %>% 
  filter(carrier == "AS" | carrier == "F9") %>% 
  left_join(weather, by = c("year", "month", "day", "hour", "origin")) %>% 
  filter(dep_delay < 250) %>% 
  mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))

ggplot(data = flights_subset, mapping = aes(x = temp, y = dep_delay, col = carrier)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~summer)

The Future of ModernDive

The Immediate Future

By July 1st, 2017

Complete Development of Chapters 6-9 on Simulations, Hypothesis Testing, Confidence Intervals, and Regression
Learning Checks: Discussion/solutions embedded directly in the textbook, that you can reveal progressively.
Have better data frame printing. Instead of raw R code use
- Less jarring knitr::kable() output or
- Interactive table outputs, to partically replicate RStudio’s View() function.

The Longer Term

In Chapter 9: Regression
- Add more on categorical predictors
- Multiple regression

The Longer Term

DataCamp

Continue to build supplementary materials for other disciplines
Effective Data Storytelling using the tidyverse designed for Social Scientists/Data Journalists
Add DataCamp Light chunks into the book to enable student practice right there inside the textbook via the tutorial package

The Longer Term

Implement Cognitive Science Research

Work on interleaving and spaced practice inside the textbook to improve student learning and retention
Follow the principles and research laid out by the Learning Scientists

The Longer Term

Further Develop Interactive Applets

To help students visualize and understand inferential processes
Sampling app
Probability Distribution Viewer and Calculator
(MAYBE) Learn D3.js and create applets like those at
Seeing Theory

Introduction to `bookdown`

What is Markdown?

A “plaintext formatting syntax”
Type in plain text, render to more complex formats
One step beyond writing a txt file
Render to HTML, PDF, DOCX, etc. using Pandoc

What does it look like?

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This

]

What is R Markdown?

“Literate programming”
Embed R code in a Markdown document
Renders textual output along with graphics


```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```

```{r chunk2}
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```

]

[1] 88

]

What is `bookdown`?

From bookdown book about bookdown:

Author books with R Markdown, including generating figures and tables, and inserting cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org).

DEMO of `bookdown`
with ModernDive Light

Uploading to GitHub Pages

Create a GitHub Pages personal page. Ex: rudeboybert.github.io
Create New Repository
Follow instructions in 2nd paragraph of Authoring Books and Technical Documents with R Markdown
Drag and drop onto GitHub repository using “Upload Files” Button

Thanks for attending!

Source code for ModernDive
- Feel free to modify the book as you wish for your own needs! Just please list the authors as “Chester Ismay, Albert Y. Kim, and YOU!”
These slides available at http://bit.ly/uscots2017-slides
Slides created via the R package xaringan by Yihui Xie
Source code for these slides at https://github.com/ismayc/moderndive-workshops

A Fully Customizable Textbook for Introductory Statistics/Data Science

USCOTS 2017 Workshop

Chester Ismay and Albert Y. Kim

May 17 & 18, 2017 Slides at http://bit.ly/uscots17-slides Supplementary HTML document at http://bit.ly/uscots17-html

Introduction

Who We Are

Outline of Workshop

Our Textbook

Albert’s Course (Intro to Statistical & Data Sciences)

Chester’s Course (Social Statistics)

What Are We Doing And Why?

1) Data First!

1) Data First!

1) Data First!

2) Computers, Not Math!

2) Computers, Not Math!

2) Computers, Not Math!

3) Algorithms, Computation, & Coding

3) Algorithms, Computation, & Coding

4) Complete Reproducibility

Let’s Dive In!

Baby’s First Bookdown

Getting Started

DataCamp

DataCamp

DataCamp Pros

DataCamp Pros

DataCamp Cons

DataCamp Conclusion

Chester’s First Bookdown Project

Short break?

Important R ideas for students to know ASAP

Welcome to the tidyverse!

Chapter 3: Tidy Data?

What is Tidy Data?

Differentiating between neat data and tidy data

Is this tidy?

How about this? Is this tidy?

Beginning steps

Chapter 4: Data Viz

Grammar of Graphics

Grammar of Graphics in R

What is a statistical graphic?

A mapping of data variables

to aes()thetic attributes

of geom_etric objects.

Back to basics

Consider the following data in tidy format:

Reproducing the plots in ggplot2

1. A scatterplot

Reproducing the plots in ggplot2

2. A scatter plot where the color of the points corresponds to D

Reproducing the plots in ggplot2

3. A scatter plot where the size of the points corresponds to C

Reproducing the plots in ggplot2

4. A line graph

Reproducing the plots in ggplot2

5. A line graph where the color of the line corresponds to D with points added that are all blue of size 4.

Why is tidy data important?

Think about trying to plot democracy score across years in the simplest way possible with the data on the Is this tidy? slide.

Tidy is good

Let’s plot it

The Five-Named Graphs

The 5NG of data viz

Line graph: geom_line()

More examples

Histogram

Boxplot (broken)

Boxplot (fixed)

Bar graph

How about over time?

How about over time? (Stacked)

How about over time? (Side-by-side)

How about over time? (Stacked proportional)

ggplot2 is for beginners and for data science professionals!

Practice

HINTS

DEMO in RStudio

Determining the appropriate plot

Chapter 5: Data Wrangling

May 17 & 18, 2017

Slides at http://bit.ly/uscots17-slides
Supplementary HTML document at http://bit.ly/uscots17-html

A `mapping` of
`data` variables

to
`aes()`thetic attributes

of
`geom_`etric objects.

Reproducing the plots in `ggplot2`

Reproducing the plots in `ggplot2`

2. A scatter plot where the `color` of the points corresponds to `D`

Reproducing the plots in `ggplot2`

3. A scatter plot where the `size` of the points corresponds to `C`

Reproducing the plots in `ggplot2`

Reproducing the plots in `ggplot2`

5. A line graph where the `color` of the line corresponds to `D` with points added that are all blue of size 4.

Line graph: `geom_line()`

`ggplot2` is for beginners and for data science professionals!

`gapminder` data frame in the `gapminder` package

Base R versus the `tidyverse`

The pipe `%>%`

`geom_point()`
`geom_line()`
`geom_histogram()`
`geom_boxplot()`
`geom_bar()`

`filter()`
`summarize()`
`group_by()`
`mutate()`
`arrange()`

`filter()`

Use `|` to check for any in multiple filters being true:

Use `&` or `,` to check for all of multiple filters being true:

Use `%in%` to check for any being true
(shortcut to using `|` repeatedly with `==`)

`summarize()`

Combining `summarize()` with `group_by()`

`ggplot2` revisited

`mutate()`

`mutate()`

`mutate()`

`arrange()`

`arrange()`

Don’t mix up `arrange` and `group_by`

Don’t mix up `arrange` and `group_by`

Don’t mix up `arrange` and `group_by`

What is meant by “joining data frames” and
why is it useful?

`mosaic` Package

1. `ggplot2` Primes Regression

1. `ggplot2` Primes Regression

1. `ggplot2` Primes Regression

1. `ggplot2` Primes Regression

2. `broom` Package

2. `broom` Package

2. `broom` Package