ismayc@old_man_chesterrudeboybert@rudeboybertclass: center, middle
Available in the supplementary HTML document here.
tidyverse, Cobb (TAS 2015): Minimizing prerequisites to research. In other words, focus on entirety of Wickham/Grolemund’s pipeline…
Furthermore use data science tools that a data scientist would use. Example: tidyverse
What does this buy us?
nycflights13 and fivethirtyeightCobb (TAS 2015): Two possible “computational engines” for statistics, in particular relating to sampling:
–
We present students with a choice for our “engine”:
| Either we use this… | Or we use this… |
|---|---|
–
–
What does this buy us?
Why should we do this?
class: center, middle, inverse
master.zipmoderndiver-lite.Rproj to open in RStudioinstall.packages('knitr', repos = c('http://rforge.net', 'http://cran.rstudio.org'), type = 'source')DataCamp offers an interactive, browser based tool for learning R/Python. Their two flagship R courses, both of which are free:
Outsource many essential but not fun to teach topics like
ggplot2, dplyr, rmarkdown, and RStudio IDE.Getting used to R, RStudio, and R Markdown
class: inverse, center, middle
Vector/variable - Type of vector (int, num, chr, logical, date)
–
Data frame - Vectors of (potentially) different types - Each vector has the same number of rows
class: center, middle
The tidyverse is a collection of R packages that share common philosophies and are designed to work together.
The third point means we don’t mix apples and oranges.
Neat data is - easy to look at, - organized nicely, and - in table form.
–
Tidy data is neat but also abides by a set of three rules.
class: center, middle
name: demscore
Frequently the first thing to do when given a dataset is to
This will help with
class: center, middle
Inspired by Hans Rosling
class: center, middle
Wilkinson (2005) laid out the proposed
“Grammar of Graphics”
class: center, middle
Wickham implemented the grammar in R
in the ggplot2 package
class: center, middle
–
mapping of data variables–
aes()thetic attributes–
geom_etric objects.class: inverse, center, middle
x-axis is variable A and the y-axis is variable Bcolor of the points corresponds to Dsize of the points corresponds to Ccolor of the line corresponds to D with points added that are all green of size 4.ggplot2library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_point()–
ggplot2color of the points corresponds to Dlibrary(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_point(mapping = aes(color = D))–
ggplot2size of the points corresponds to Clibrary(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) +
geom_point()–
ggplot2library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_line()–
ggplot2color of the line corresponds to D with points added that are all blue of size 4.library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_line(mapping = aes(color = D)) +
geom_point(color = "blue", size = 4)–
name: whytidy
year on the x-axis anddem_score on the y-axis.ggplotdem_score4 <- dem_score_tidy %>%
filter(country %in% c("Australia", "Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
geom_line(mapping = aes(color = country))geom_point()geom_line()geom_histogram()geom_boxplot()geom_bar()class: center, middle
library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
geom_histogram(bins = 20, color = "black", fill = "darkorange")library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
geom_boxplot()library(nycflights13)
ggplot(data = weather, mapping = aes(x = factor(month), y = humid)) +
geom_boxplot()library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
geom_bar()dplyrlibrary(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
"'90-'94", "'95-'99", "'00-'04", "'05-'09",
"'10-'13")
bechdel <- bechdel %>%
mutate(five_year = cut(year,
breaks = seq(1969, 2014, 5),
labels = year_bins)) %>%
mutate(clean_test = factor(clean_test,
levels = c("nowomen", "notalk", "men",
"dubious", "ok")))library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar()library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar(position = "dodge")library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar(position = "fill", color = "black")class: center, middle
Produce appropriate 5NG with R package & data set in [ ], e.g., [nycflights13 \(\rightarrow\) weather]
Does age predict recline_rude?
[fivethirtyeight \(\rightarrow\) na.omit(flying)]
Distribution of age by sex
[okcupiddata \(\rightarrow\) profiles]
Does budget predict rating?
[ggplot2movies \(\rightarrow\) movies]
Distribution of log base 10 scale of budget_2013
[fivethirtyeight \(\rightarrow\) bechdel]
class: inverse, center, middle
gapminder data frame in the gapminder packagelibrary(gapminder)
gapmindertidyverse# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)[1] 60.0649
–
library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))%>%dplyr equivalent to the + in ggplot2geom_point()geom_line() geom_histogram()geom_boxplot()geom_bar()filter() summarize() group_by() mutate() arrange()filter()Select a subset of the rows of a data frame.
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)== to compare a variable to a value| to check for any in multiple filters being true:gapminder %>%
filter(year == 2002 | continent == "Europe")–
& or , to check for all of multiple filters being true:gapminder %>%
filter(year == 2002, continent == "Europe")%in% to check for any being true | repeatedly with ==)gapminder %>%
filter(country %in% c("Argentina", "Belgium", "Mexico"),
year %in% c(1987, 1992))–
summarize()summarize().max_exp_1997 <- gapminder %>%
filter(year == 1997) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997–
summarize() with group_by()When you’d like to determine a numerical summary for all levels of a different categorical variable
max_exp_1997_by_cont <- gapminder %>%
filter(year == 1997) %>%
group_by(continent) %>%
summarize(max_exp = max(lifeExp),
sd_exp = sd(lifeExp))
max_exp_1997_by_contggplot2 revisitedFor aggregated data, use geom_col
ggplot(data = max_exp_1997_by_cont,
mapping = aes(x = continent, y = max_exp)) +
geom_col() +
geom_errorbar(mapping = aes(ymin = max_exp - sd_exp,
ymax = max_exp + sd_exp))filter()summarize()group_by()–
mutate()–
arrange()mutate()–
gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)mutate()–
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)mutate()–
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)arrange()gapminder %>%
arrange(year, country)arrange()gapminder %>%
filter(year > 2000) %>%
arrange(desc(lifeExp)) %>%
head(10)arrange and group_bygroup_by is used (mostly) with summarize to calculate summaries over groups
arrange is used for sorting
arrange and group_byThis doesn’t really do anything useful
gapminder %>% group_by(year)arrange and group_byBut this does
gapminder %>% arrange(year)True or False
Each of
filter,mutate, andarrangechange the observational unit.
–
True or False
group_by() %>% summarize()changes the observational unit.
class: center
–
library(fivethirtyeight)
library(readr)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police_locals, y = ideology, by = "city")
rmarkdown::paged_table(police_join)cost_of_living <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/cost_of_living.csv")
police_join_cost <- inner_join(x = police_join, y = cost_of_living, by = "state")
rmarkdown::paged_table(police_join_cost)ggplot(data = police_join_cost,
mapping = aes(x = index, y = all)) +
geom_point(aes(color = state_ideology)) +
labs(x = "Cost of Living Index", y = "% Officers Living in City")Use the 5MV to answer problems from R data packages, e.g., [nycflights13 \(\rightarrow\) weather]
What is the maximum arrival delay for each carrier departing JFK? [nycflights13 \(\rightarrow\) flights]
Calculate the domestic return on investment for 2013 scaled data descending by ROI
[fivethirtyeight \(\rightarrow\) bechdel]
Include the name of the carrier as a column in the flights data frame
[nycflights13 \(\rightarrow\) flights, airlines]
class: inverse, center, middle
class: inverse, center, middle
We now enter the “traditional” topics of intro stats…
… but now students are armed with
Sampling is at the root of statistics. Two approaches:
| Either we use this… | Or we use this… |
|---|---|
mosaic PackageIt has functions for random simulations:
rflip(): Flip a coinshuffle(): Shuffle a set of valuesdo(): Do the same thing many, many, many timesresample(): the swiss army knife for samplingThe example will be built around this code: (Available in the supplementary HTML document here.)
library(ggplot2)
library(dplyr)
library(mosaic)
single_cup_guess <- c(1, 0)
simulation <- do(10000) * resample(single_cup_guess, size=8, replace=TRUE)
View(simulation)
simulation <- simulation %>%
mutate(n_correct = V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8)
View(simulation)
ggplot(simulation, aes(x=n_correct)) +
geom_bar() +
labs(x="Number of Guesses Correct", title="Number Correct Assuming She Is Guessing at Random") +
geom_vline(xintercept=8, col="red")There is only one test; it has 5 components:
Posed to students: Did students with an even # of letters in last name do better than those with odd #? (Available in the supplementary HTML document here.)
library(tidyverse)
library(mosaic)
grades <- read_csv("https://raw.githubusercontent.com/ismayc/moderndive-workshops/master/docs/slides/data/grades.csv")
View(grades)
# Observed Difference: Using mosaic package's mean function
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()
null_distribution <- (do(1000) * mean(final ~ shuffle(even_vs_odd), data=grades)) %>%
mutate(difference=odd-even)
View(null_distribution)
# Plot
ggplot(data=null_distribution , aes(x=difference)) +
geom_histogram(binwidth = 0.025) +
labs(x="Avg of Odds - Avg of Evens") +
geom_vline(xintercept = observed_diff, col="red")The example will be built around this code: (Available in the supplementary HTML document here)
n and repeating. What does this correspond to doing in real life?n plays when it comes to estimating \(\mu\).library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
# For simplicity, remove 3 individuals who did not list their height
profiles <- profiles %>%
filter(!is.na(height))
# Population mean
mu <- mean(profiles$height)
# Sample size:
n <- 5
# Parts 1 & 2:
resample(profiles$height, size=n, replace=TRUE)
mean(resample(profiles$height, size=n, replace=TRUE))
# Part 3:
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=TRUE))
View(samples)
# Part 4:
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
labs(x="sample mean") +
xlim(c(50,80)) +
geom_vline(xintercept=mu, col="red")
# Part 5:
sd(samples$mean)Convincing students:
ggplot2 package and knowledge of the Grammar of Graphics primes students for regressionbroom package to unpack regressionggplot2 Primes Regressionggplot2 code.ggplot2 Primes RegressionExample:
Involves four variables:
carrier, temp, dep_delay, summerggplot2 Primes Regressionggplot2 Primes RegressionWhy? Dig deeper into data. Look at origin and dest variables as well:
broom Packagebroom package takes the messy output of built-in modeling functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.tidyverse ecosystembroom PackageIn our case, broom functions take lm objects as inputs and return the following in tidy format!
tidy(): regression output tableaugment(): point-by-point values (fitted values, residuals, predicted values)glance(): scalar summaries like \(R^2\),broom PackageThe chapter will be built around this code: (Available in the supplementary HTML document here).
library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)
# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
sample_n(50)
View(alaska_flights)
# Exploratory Data Analysis----------------------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
# Correlation coefficient:
alaska_flights %>%
summarize(correl = cor(dep_delay, arr_delay))
# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
# Fit Regression and Study Output with broom Package---------------------------
# Fit regression
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)
# 1. broom::tidy() regression table with confidence intervals and no p-value stars
regression_table <- delay_fit %>%
tidy(conf.int=TRUE)
regression_table %>%
kable(digits=3)
# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>%
augment() %>%
select(arr_delay, dep_delay, .fitted, .resid)
regression_points %>%
head() %>%
kable(digits=3)
# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>%
augment(newdata = new_flights) %>%
kable()
# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>%
glance()
regression_summaries %>%
kable(digits=3)
# Residual Analysis------------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
geom_histogram(binwidth=10) +
geom_vline(xintercept = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_abline(intercept = 0, slope = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
stat_qq()
# Preview of Multiple Regression-----------------------------------------------
flights_subset <- flights %>%
filter(carrier == "AS" | carrier == "F9") %>%
left_join(weather, by = c("year", "month", "day", "hour", "origin")) %>%
filter(dep_delay < 250) %>%
mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))
ggplot(data = flights_subset, mapping = aes(x = temp, y = dep_delay, col = carrier)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~summer)class: inverse, center, middle
By July 1st, 2017
knitr::kable() output orView() function.tutorial packageclass: inverse, center, middle
bookdowntxt file.left-column[
# Header 1
## Header 2
Normal paragraphs of text go here.
**I'm bold**
[links!](http://rstudio.com)
* Unordered
* Lists
And Tables
---- -------
Like This
]
.right-column[ ]
.left-column[
```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>%
filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```
```{r chunk2}
ggplot(data = pdx_flights,
mapping = aes(x = arr_delay,
y = dep_delay)) +
geom_point()
```
]
.right-column[
[1] 88
]
bookdown?From bookdown book about bookdown:
Author books with R Markdown, including generating figures and tables, and inserting cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org).
class: inverse, center, middle
bookdown class: middle