Introduction

Who We Are


class: center, middle

Our Textbook


Albert’s Course (Intro to Statistical & Data Sciences)

Available in the supplementary HTML document here.

  • Webpage and GitHub Repo
  • Administrative:
    • Chief non-econ/bio stats service class at Middlebury
    • 12 weeks each with 3h “lecture” + 1h “lab”
    • No prerequisites
  • Students:
    • ~24 students/section of all years/backgrounds. Only stats class many will take
    • Background: Many had AP stats, some with programming
    • All had laptops that they brought everyday
  • Topic List
    • First half is data science: data visualization, manipulation, importing
    • Second half is intro stats: sampling, hypothesis tests, CI, regression
  • Evaluation
    • 10%: weekly problem sets
    • 10%: engagement
    • 45%: 3 midterms (last during finals week)
    • 35%: Final projects
  • Typical Classtime:
    • First 10-15min: Priming topic, either via slides or chalk talk
    • Remainder: Students read over text & do Learning Checks in groups and without direct instructor guidance.

Chester’s Course (Social Statistics)

Available in the supplementary HTML document here

  • Webpage at <http://bit.ly/soc-301> and GitHub Repo
  • Administrative:
    • Chief stats service class for sociology/criminal justice
    • An option take to fulfill the Pacific U. math requirement
    • 14 weeks, meeting on Tues & Thurs for 95 minutes
    • No prerequisites
  • Students:
    • 26 students of all years/backgrounds. Only stats class many will take
    • Background: 3 had AP stats, zero with programming
    • All had laptops that they brought everyday
  • Course Schedule
    • First half is data science: data visualization, wrangling, importing
    • Second half is intro stats: sampling, testing, CI
  • Evaluation
    • 5%: Engagement/Pass-fail Learning Checks
    • 10%: DataCamp/article summarizing assignments
    • 15%: Group Project
    • 20%: Pencil-and-paper Midterm Exam
    • 25%: (5) Multiple choice cumulative quizzes
    • 25%: Cumulative Pencil-and-paper Final Exam
  • Typical Classtime:
    • First 5-10min: Students answer warmup exercise based on previous content
    • Next 10-20min: Review reading assignment via slides
    • Bulk of class:
      • Students read over text & do Learning Checks in groups and without direct instructor guidance.
      • Students work on next DataCamp problems and ask questions as needed
    • Last 5-10min: Go over warmup exercise again or quiz students on material from that period

What Are We Doing And Why?

  1. Data first! Start with data science via tidyverse,
    then stats builds on these ideas.
  2. Replacing the mathematical/analytic with computational/simulation-based whenever possible.
  3. The above necessitates algorithmic thinking, computational logic and some coding/programming.
  4. Complete reproducibility

1) Data First!

Cobb (TAS 2015): Minimizing prerequisites to research. In other words, focus on entirety of Wickham/Grolemund’s pipeline…


1) Data First!

Furthermore use data science tools that a data scientist would use. Example: tidyverse



1) Data First!

What does this buy us?

  • Students can do effective data storytelling
  • Context for asking scientific questions
  • Look at data that’s rich, real, and realistic. Examples: Data packages such as nycflights13 and fivethirtyeight
  • Better motivate traditional statistical topics

2) Computers, Not Math!

Cobb (TAS 2015): Two possible “computational engines” for statistics, in particular relating to sampling:

  • Mathematics: formulas, probability theory, large-sample approximations, central limit theorem

  • Computers: simulations, resampling methods

2) Computers, Not Math!

We present students with a choice for our “engine”:


Either we use this… Or we use this…
Drawing Drawing


  • Almost all are thrilled to do the latter

  • Leave “bread crumbs” for more advanced math/stats courses

2) Computers, Not Math!

What does this buy us?

  • Emphasizes: stats is not math, rather stats uses math.
  • Simulations are more tactile
  • Reducing probability and march to CLT, this frees up space in syllabus.

3) Algorithms, Computation, & Coding

  • Both “Data First!” and “Computers, Not Math!” necessitate algorithmic thinking, computational logic, and some coding/programming.
  • Battle is more psychological than anything:
    • “This is not a class on programming!”
    • “Computers are stupid!”
    • “Learning to code is like learning a foreign language!”
    • “Early on don’t code from scratch! Take something else that’s similar and tweak it!”
    • Learning how to Google effectively

3) Algorithms, Computation, & Coding

Why should we do this?

  • Data science and machine learning.
  • Where statistics is heading. Gelman blog post.
  • If we don’t, we are doing a disservice to students by shielding them from these computational ideas.
  • Bigger picture: Coding is becoming a basic skill like reading and writing.

4) Complete Reproducibility

  • Students learn best when they can take apart a toy (analysis) and then rebuild it (synthesis).
  • Crisis in Reproducibility
  • Ultimately the best textbook is one you’ve written yourself.
    • Everyone has different contexts, backgrounds, needs
    • Hard to find one-size-fits-all solutions
  • A new paradigm in textbooks? Versions, not editions?

class: center, middle, inverse

Let’s Dive In!


Baby’s First Bookdown

  • ModernDive Light: Just Data Science Chapters of Bookdown
  • Download this ZIP file & extract the contents to a folder on your computer master.zip
  • Double click moderndiver-lite.Rproj to open in RStudio
  • Build -> Build Book
    • install.packages('knitr', repos = c('http://rforge.net', 'http://cran.rstudio.org'), type = 'source')

Getting Started

DataCamp

DataCamp offers an interactive, browser based tool for learning R/Python. Their two flagship R courses, both of which are free:


DataCamp

Outsource many essential but not fun to teach topics like

  • Idea of command-line vs point-and-click
  • Syntax: Variable names typed exactly, parentheses matching
  • Algorithmic Thinking: Linearity of code, object assignment
  • Computational Logic: boolean algebra, conditional statements, functions

DataCamp Pros

  • Can assign “Intro to R” first day of class as “Homework 0”
  • Outsourcing allows you to free up class time
  • Students get immediate feedback on whether or not their code works
    • Often, the DataCamp error messages are much more useful than the ones R gives

DataCamp Pros

  • With their free academic license, you can
    • Form class “Groups” and assign/track progress of DataCamp courses
    • Have free access to ALL their courses, including ggplot2, dplyr, rmarkdown, and RStudio IDE.
    • Create your own free DataCamp course covering content you want your students to learn using R

DataCamp Cons

  • Some students will still have trouble; you can identify them however.
  • The topics in these two free courses may not align with your syllabus. You can assign at chapter level instead of course level though

DataCamp Conclusion

  • Not a good tool for “quick retention,” but for R concept introduction and subsequent repetition.
  • Students need to practice “speaking the language” just like with a foreign language.
  • Feedback from students was positive.
  • Battle is more psychological than anything. DataCamp reinforces to students that
    • “Computers are stupid!”
    • “Learning to code is like learning a foreign language!”

Chester’s First Bookdown Project

Getting used to R, RStudio, and R Markdown

  • Designed to provide students with GIFs to follow along with and a description of all the components of RStudio and R Markdown

class: inverse, center, middle

Short break?


Important R ideas for students to know ASAP

Vector/variable - Type of vector (int, num, chr, logical, date)

Data frame - Vectors of (potentially) different types - Each vector has the same number of rows


class: center, middle

Welcome to the tidyverse!

The tidyverse is a collection of R packages that share common philosophies and are designed to work together.


Chapter 3: Tidy Data?

Drawing

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The third point means we don’t mix apples and oranges.


What is Tidy Data?

  1. Each observation forms a row. In other words, each row corresponds to a single instance of an observational unit
  2. Each variable forms a column:
    • Some variables may be used to identify the observational units.
    • For organizational purposes, it’s generally better to put these in the left-hand columns
  3. Each type of observational unit forms a table.

Differentiating between neat data and tidy data

  • Colloquially, they mean the same thing
  • But in our context, one is a subset of the other.


Neat data is - easy to look at, - organized nicely, and - in table form.

Tidy data is neat but also abides by a set of three rules.


class: center, middle

Drawing


Is this tidy?


name: demscore

How about this? Is this tidy?

Why is tidy data important? slide


Beginning steps

Frequently the first thing to do when given a dataset is to

  • check that the data is tidy,
  • identify the observational unit,
  • specify the variables, and
  • give the types of variables you are presented with.

This will help with

  • choosing the appropriate plot,
  • summarizing the data, and
  • understanding which inferences can be applied.

class: center, middle

Chapter 4: Data Viz

Inspired by Hans Rosling


  • What are the variables here?
  • What is the observational unit?
  • How are the variables mapped to aesthetics?

class: center, middle

Grammar of Graphics

Wilkinson (2005) laid out the proposed
“Grammar of Graphics”



class: center, middle

Grammar of Graphics in R

Wickham implemented the grammar in R
in the ggplot2 package



class: center, middle

What is a statistical graphic?

A mapping of
data variables

to
aes()thetic attributes

of
geom_etric objects.


class: inverse, center, middle

Back to basics


Consider the following data in tidy format:

  • Sketch the graphics below on paper, where the x-axis is variable A and the y-axis is variable B
  1. A scatter plot
  2. A scatter plot where the color of the points corresponds to D
  3. A scatter plot where the size of the points corresponds to C
  4. A line graph
  5. A line graph where the color of the line corresponds to D with points added that are all green of size 4.

Reproducing the plots in ggplot2

1. A scatterplot

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point()


Reproducing the plots in ggplot2

2. A scatter plot where the color of the points corresponds to D

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))


Reproducing the plots in ggplot2

3. A scatter plot where the size of the points corresponds to C

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()


Reproducing the plots in ggplot2

4. A line graph

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line()


Reproducing the plots in ggplot2

5. A line graph where the color of the line corresponds to D with points added that are all blue of size 4.

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "blue", size = 4)


name: whytidy

Why is tidy data important?

  • Think about trying to plot democracy score across years in the simplest way possible with the data on the Is this tidy? slide.

  • It would be much easier if the data looked like what follows instead so we could put
    • year on the x-axis and
    • dem_score on the y-axis.

Tidy is good


Let’s plot it

  • Plot the line graph for 4 countries using ggplot
dem_score4 <- dem_score_tidy %>%
  filter(country %in% c("Australia", "Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
  geom_line(mapping = aes(color = country))


The Five-Named Graphs

The 5NG of data viz

  • Scatterplot: geom_point()
  • Line graph: geom_line()

  • Histogram: geom_histogram()
  • Boxplot: geom_boxplot()
  • Bar graph: geom_bar()

class: center, middle

More examples


Histogram

library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
  geom_histogram(bins = 20, color = "black", fill = "darkorange")


Boxplot (broken)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
  geom_boxplot()


Boxplot (fixed)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = factor(month), y = humid)) +
  geom_boxplot()


Bar graph

library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
  geom_bar()


How about over time?

  • Hop into dplyr
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
               "'90-'94", "'95-'99", "'00-'04", "'05-'09",
               "'10-'13")
bechdel <- bechdel %>%
  mutate(five_year = cut(year, 
                         breaks = seq(1969, 2014, 5), 
                         labels = year_bins)) %>% 
  mutate(clean_test = factor(clean_test, 
                             levels = c("nowomen", "notalk", "men",
                                        "dubious", "ok")))

How about over time? (Stacked)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar()


How about over time? (Side-by-side)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "dodge")


How about over time? (Stacked proportional)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "fill", color = "black")


class: center, middle

ggplot2 is for beginners and for data science professionals!


Practice

Produce appropriate 5NG with R package & data set in [ ], e.g., [nycflights13 \(\rightarrow\) weather]

  1. Does age predict recline_rude?
    [fivethirtyeight \(\rightarrow\) na.omit(flying)]

  2. Distribution of age by sex
    [okcupiddata \(\rightarrow\) profiles]

  3. Does budget predict rating?
    [ggplot2movies \(\rightarrow\) movies]

  4. Distribution of log base 10 scale of budget_2013
    [fivethirtyeight \(\rightarrow\) bechdel]


HINTS


class: inverse, center, middle

DEMO in RStudio


class: center, middle

Determining the appropriate plot


class: center, middle

Chapter 5: Data Wrangling


gapminder data frame in the gapminder package

library(gapminder)
gapminder

Base R versus the tidyverse

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649

library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

The pipe %>%

   

  • A way to chain together commands

  • It is essentially the dplyr equivalent to the
    + in ggplot2

## The 5NG of data viz

geom_point()
geom_line()
geom_histogram()
geom_boxplot()
geom_bar()


The Five Main Verbs (5MV) of data wrangling

filter()
summarize()
group_by()
mutate()
arrange()


filter()

  • Select a subset of the rows of a data frame.

  • The arguments are the “filters” that you’d like to apply.

library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)
  • Use == to compare a variable to a value

Logical operators

  • Use | to check for any in multiple filters being true:

gapminder %>% 
  filter(year == 2002 | continent == "Europe")


Logical operators

  • Use & or , to check for all of multiple filters being true:

gapminder %>% 
  filter(year == 2002, continent == "Europe")

Logical operators

  • Use %in% to check for any being true
    (shortcut to using | repeatedly with ==)

gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))


summarize()

  • Any numerical summary that you want to apply to a column of a data frame is specified within summarize().
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997


Combining summarize() with group_by()

When you’d like to determine a numerical summary for all levels of a different categorical variable

max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp),
            sd_exp = sd(lifeExp))
max_exp_1997_by_cont

ggplot2 revisited

For aggregated data, use geom_col

ggplot(data = max_exp_1997_by_cont, 
       mapping = aes(x = continent, y = max_exp)) +
  geom_col() +
  geom_errorbar(mapping = aes(ymin = max_exp - sd_exp,
                              ymax = max_exp + sd_exp))


The 5MV

  • filter()
  • summarize()
  • group_by()

  • mutate()

  • arrange()

mutate()

  • Allows you to
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)

mutate()

  • Allows you to
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)

mutate()

  • Allows you to
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)

arrange()

  • Reorders the rows in a data frame based on the values of one or more variables

gapminder %>%
  arrange(year, country)

arrange()

  • Can also put into descending order

gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp)) %>%
  head(10)

Don’t mix up arrange and group_by

  • group_by is used (mostly) with summarize to calculate summaries over groups

  • arrange is used for sorting


Don’t mix up arrange and group_by

This doesn’t really do anything useful

gapminder %>% group_by(year)

Don’t mix up arrange and group_by

But this does

gapminder %>% arrange(year)

Changing of observation unit

True or False

Each of filter, mutate, and arrange change the observational unit.

True or False

group_by() %>% summarize() changes the observational unit.


class: center

What is meant by “joining data frames” and
why is it useful?


Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?

library(fivethirtyeight)
library(readr)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police_locals, y = ideology, by = "city")
rmarkdown::paged_table(police_join)

cost_of_living <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/cost_of_living.csv")
police_join_cost <- inner_join(x = police_join, y = cost_of_living, by = "state")
rmarkdown::paged_table(police_join_cost)

Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?

ggplot(data = police_join_cost,
       mapping = aes(x = index, y = all)) +
  geom_point(aes(color = state_ideology)) +
  labs(x = "Cost of Living Index", y = "% Officers Living in City")


Practice

Use the 5MV to answer problems from R data packages, e.g., [nycflights13 \(\rightarrow\) weather]

  1. What is the maximum arrival delay for each carrier departing JFK? [nycflights13 \(\rightarrow\) flights]

  2. Calculate the domestic return on investment for 2013 scaled data descending by ROI
    [fivethirtyeight \(\rightarrow\) bechdel]

  3. Include the name of the carrier as a column in the flights data frame
    [nycflights13 \(\rightarrow\) flights, airlines]


class: inverse, center, middle

DEMO in RStudio


class: inverse, center, middle

Statistical Inference


Statistical Inference

We now enter the “traditional” topics of intro stats…

  1. Sampling theory
  2. Hypothesis testing
  3. Confidence intervals
  4. Regression

Statistical Inference

… but now students are armed with

  1. Data visualization
  2. Data wrangling skills
  3. Most important: comfort with coding!

Chapter 6: Sampling Highlights

Sampling is at the root of statistics. Two approaches:


Either we use this… Or we use this…
Drawing Drawing



mosaic Package

It has functions for random simulations:


  1. rflip(): Flip a coin
  2. shuffle(): Shuffle a set of values
  3. do(): Do the same thing many, many, many times
  4. resample(): the swiss army knife for sampling

Lady Tasting Tea

Drawing

Presented to Students As:

  • Say you are a statistician and you meet someone called the “Lady Tasting Tea.”
  • She claims she can tell by tasting whether the tea or the milk was added first to a cup.
  • You want to test whether
    • She can actually tell which came first
    • She’s lying and is really guessing at random
  • Say you have just enough tea/milk to pour into 8 cups.

Lady Tasting Tea

The example will be built around this code: (Available in the supplementary HTML document here.)

library(ggplot2)
library(dplyr)
library(mosaic)

single_cup_guess <- c(1, 0)
simulation <- do(10000) * resample(single_cup_guess, size=8, replace=TRUE)
View(simulation)

simulation <- simulation %>% 
  mutate(n_correct = V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8) 
View(simulation)

ggplot(simulation, aes(x=n_correct)) + 
  geom_bar() +
  labs(x="Number of Guesses Correct", title="Number Correct Assuming She Is Guessing at Random") +
  geom_vline(xintercept=8, col="red")

Chapter 7: Hypothesis Testing Highlights

There is only one test; it has 5 components:

  1. Define \(H_0\) and \(H_A\)
  2. Define the test statistic \(\delta\)
  3. Compute the observed test statistic \(\delta^*\)
  4. Construct the null distribution either
    • Mathematically
    • Via Simulation
  5. Compare \(\delta^*\) to null distribution to compute p-value

There is Only One Test: Lady Tasting Tea

  1. She is guessing at random vs she can tell which came first
  2. Test statistic: Number out of 8 shes guesses right
  3. Observed test statistic: 8 of 8. The red line!
  4. Null distribution: (simulated) bar graph!
  5. p-value: Very small! Above 0.36%

There is Only One Test: Goodness-of-Fit

  1. Observations fit expected distibution vs not
  2. Test statistic: \(\sum_{i=1}^{k}\frac{\left(\mbox{Obs}_i-\mbox{Exp}_i\right)^2}{\mbox{Exp}_i}\)
  3. Observed test statistic: Compute using data!
  4. Null Distribution: (mathematically derived)
    Chi-Squared Dist’n.
  5. Area to the right!

There is Only One Test

Drawing

Two-Sample Permutation Test

Posed to students: Did students with an even # of letters in last name do better than those with odd #? (Available in the supplementary HTML document here.)

library(tidyverse)
library(mosaic)
grades <- read_csv("https://raw.githubusercontent.com/ismayc/moderndive-workshops/master/docs/slides/data/grades.csv")
View(grades)

# Observed Difference: Using mosaic package's mean function
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()

null_distribution <- (do(1000) * mean(final ~ shuffle(even_vs_odd), data=grades)) %>%
  mutate(difference=odd-even)
View(null_distribution)

# Plot
ggplot(data=null_distribution , aes(x=difference)) +
  geom_histogram(binwidth = 0.025) +
  labs(x="Avg of Odds - Avg of Evens") +
  geom_vline(xintercept = observed_diff, col="red")

Chapter 8: Confidence Intervals Highlights

  • Not only show students repeated sampling (with a Shiny app for example), let them repeatedly sample!
  • In other words, let them construct sampling distributions.

Sampling Distribution, SE, & C.I. Example

The example will be built around this code: (Available in the supplementary HTML document here)

  1. Discuss with your seatmates what all 5 code parts below are doing.
  2. Try increasing n and repeating. What does this correspond to doing in real life?
  3. How does the histogram change?
  4. Describe using statistical language the role n plays when it comes to estimating \(\mu\).



library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)

# For simplicity, remove 3 individuals who did not list their height
profiles <- profiles %>% 
  filter(!is.na(height))

# Population mean
mu <- mean(profiles$height)

# Sample size:
n <- 5

# Parts 1 & 2:
resample(profiles$height, size=n, replace=TRUE)
mean(resample(profiles$height, size=n, replace=TRUE))

# Part 3:
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=TRUE))
View(samples)

# Part 4:
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  labs(x="sample mean") + 
  xlim(c(50,80)) +
  geom_vline(xintercept=mu, col="red")

# Part 5:
sd(samples$mean)

The Hard Part

Convincing students:

  • We only do the blue via simulation as a theoretical exercise
  • We do the purple in real life


Drawing

Chapter 9: Regression Highlights

  1. Experience with ggplot2 package and knowledge of the Grammar of Graphics primes students for regression
  2. Use of the broom package to unpack regression

1. ggplot2 Primes Regression

  • Mapping aesthetics to variables provides a natural framework for all of data visualization. Understanding the relationships between variables is clear and transparent from the ggplot2 code.
  • This ultimately what regression is about!

1. ggplot2 Primes Regression

Example:

  • All Alaskan Airlines and Frontier flights leaving NYC in 2013
  • We want to study the relationship between temperature and departure delay
  • For summer (June, July, August) and non-summer months separately

Involves four variables:

  • carrier, temp, dep_delay, summer

1. ggplot2 Primes Regression


1. ggplot2 Primes Regression

Why? Dig deeper into data. Look at origin and dest variables as well:



2. broom Package

  • The broom package takes the messy output of built-in modeling functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
  • Fits in with tidyverse ecosystem
  • This works for many R data types!

2. broom Package

In our case, broom functions take lm objects as inputs and return the following in tidy format!

  • tidy(): regression output table
  • augment(): point-by-point values (fitted values, residuals, predicted values)
  • glance(): scalar summaries like \(R^2\),

2. broom Package

The chapter will be built around this code: (Available in the supplementary HTML document here).

library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)

# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% 
  sample_n(50)
View(alaska_flights)


# Exploratory Data Analysis----------------------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point()

# Correlation coefficient:
alaska_flights %>% 
  summarize(correl = cor(dep_delay, arr_delay))

# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")


# Fit Regression and Study Output with broom Package---------------------------
# Fit regression
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)

# 1. broom::tidy() regression table with confidence intervals and no p-value stars
regression_table <- delay_fit %>% 
  tidy(conf.int=TRUE)
regression_table %>% 
  kable(digits=3)

# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>% 
  augment() %>% 
  select(arr_delay, dep_delay, .fitted, .resid) 
regression_points %>% 
  head() %>% 
  kable(digits=3)

# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>% 
  augment(newdata = new_flights) %>% 
  kable()

# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>% 
  glance() 
regression_summaries %>% 
  kable(digits=3)


# Residual Analysis------------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
  geom_histogram(binwidth=10) +
  geom_vline(xintercept = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
  stat_qq()


# Preview of Multiple Regression-----------------------------------------------
flights_subset <- flights %>% 
  filter(carrier == "AS" | carrier == "F9") %>% 
  left_join(weather, by = c("year", "month", "day", "hour", "origin")) %>% 
  filter(dep_delay < 250) %>% 
  mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))

ggplot(data = flights_subset, mapping = aes(x = temp, y = dep_delay, col = carrier)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~summer)

class: inverse, center, middle

The Future of ModernDive


The Immediate Future

By July 1st, 2017

  • Complete Development of Chapters 6-9 on Simulations, Hypothesis Testing, Confidence Intervals, and Regression
  • Learning Checks: Discussion/solutions embedded directly in the textbook, that you can reveal progressively.
  • Have better data frame printing. Instead of raw R code use
    • Less jarring knitr::kable() output or
    • Interactive table outputs, to partically replicate RStudio’s View() function.

The Longer Term

  • In Chapter 9: Regression
    • Add more on categorical predictors
    • Multiple regression

The Longer Term

DataCamp


The Longer Term

Implement Cognitive Science Research


The Longer Term

Further Develop Interactive Applets


class: inverse, center, middle

Introduction to bookdown


What is Markdown?

  • A “plaintext formatting syntax”
  • Type in plain text, render to more complex formats
  • One step beyond writing a txt file
  • Render to HTML, PDF, DOCX, etc. using Pandoc

What does it look like?

.left-column[

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  

]

.right-column[ markdown]


What is R Markdown?

  • “Literate programming”
  • Embed R code in a Markdown document
  • Renders textual output along with graphics

.left-column[


```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```

```{r chunk2}
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```

]

.right-column[

[1] 88

]


What is bookdown?

From bookdown book about bookdown:

Author books with R Markdown, including generating figures and tables, and inserting cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org).


class: inverse, center, middle

DEMO of bookdown
with ModernDive Light


Uploading to GitHub Pages


class: middle

Thanks for attending!