ismayc
@old_man_chester
rudeboybert
@rudeboybert
class: center, middle
Available in the supplementary HTML document here.
tidyverse
, Cobb (TAS 2015): Minimizing prerequisites to research. In other words, focus on entirety of Wickham/Grolemund’s pipeline…
Furthermore use data science tools that a data scientist would use. Example: tidyverse
What does this buy us?
nycflights13
and fivethirtyeight
Cobb (TAS 2015): Two possible “computational engines” for statistics, in particular relating to sampling:
–
We present students with a choice for our “engine”:
Either we use this… | Or we use this… |
---|---|
–
–
What does this buy us?
Why should we do this?
class: center, middle, inverse
master.zip
moderndiver-lite.Rproj
to open in RStudioinstall.packages('knitr', repos = c('http://rforge.net', 'http://cran.rstudio.org'), type = 'source')
DataCamp offers an interactive, browser based tool for learning R/Python. Their two flagship R courses, both of which are free:
Outsource many essential but not fun to teach topics like
ggplot2
, dplyr
, rmarkdown
, and RStudio IDE.Getting used to R, RStudio, and R Markdown
class: inverse, center, middle
Vector/variable - Type of vector (int
, num
, chr
, logical
, date
)
–
Data frame - Vectors of (potentially) different types - Each vector has the same number of rows
class: center, middle
The tidyverse
is a collection of R packages that share common philosophies and are designed to work together.
The third point means we don’t mix apples and oranges.
Neat data is - easy to look at, - organized nicely, and - in table form.
–
Tidy data is neat but also abides by a set of three rules.
class: center, middle
name: demscore
Frequently the first thing to do when given a dataset is to
This will help with
class: center, middle
Inspired by Hans Rosling
class: center, middle
Wilkinson (2005) laid out the proposed
“Grammar of Graphics”
class: center, middle
Wickham implemented the grammar in R
in the ggplot2
package
class: center, middle
–
mapping
of data
variables–
aes()
thetic attributes–
geom_
etric objects.class: inverse, center, middle
x
-axis is variable A
and the y
-axis is variable B
color
of the points corresponds to D
size
of the points corresponds to C
color
of the line corresponds to D
with points added that are all green of size 4.ggplot2
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_point()
–
ggplot2
color
of the points corresponds to D
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_point(mapping = aes(color = D))
–
ggplot2
size
of the points corresponds to C
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) +
geom_point()
–
ggplot2
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_line()
–
ggplot2
color
of the line corresponds to D
with points added that are all blue of size 4.library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) +
geom_line(mapping = aes(color = D)) +
geom_point(color = "blue", size = 4)
–
name: whytidy
year
on the x
-axis anddem_score
on the y
-axis.ggplot
dem_score4 <- dem_score_tidy %>%
filter(country %in% c("Australia", "Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
geom_line(mapping = aes(color = country))
geom_point()
geom_line()
geom_histogram()
geom_boxplot()
geom_bar()
class: center, middle
library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
geom_histogram(bins = 20, color = "black", fill = "darkorange")
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
geom_boxplot()
library(nycflights13)
ggplot(data = weather, mapping = aes(x = factor(month), y = humid)) +
geom_boxplot()
library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
geom_bar()
dplyr
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
"'90-'94", "'95-'99", "'00-'04", "'05-'09",
"'10-'13")
bechdel <- bechdel %>%
mutate(five_year = cut(year,
breaks = seq(1969, 2014, 5),
labels = year_bins)) %>%
mutate(clean_test = factor(clean_test,
levels = c("nowomen", "notalk", "men",
"dubious", "ok")))
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar()
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar(position = "dodge")
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
mapping = aes(x = five_year, fill = clean_test)) +
geom_bar(position = "fill", color = "black")
class: center, middle
Produce appropriate 5NG with R package & data set in [ ], e.g., [nycflights13
\(\rightarrow\) weather
]
Does age
predict recline_rude
?
[fivethirtyeight
\(\rightarrow\) na.omit(flying)
]
Distribution of age
by sex
[okcupiddata
\(\rightarrow\) profiles
]
Does budget
predict rating
?
[ggplot2movies
\(\rightarrow\) movies
]
Distribution of log base 10 scale of budget_2013
[fivethirtyeight
\(\rightarrow\) bechdel
]
class: inverse, center, middle
gapminder
data frame in the gapminder
packagelibrary(gapminder)
gapminder
tidyverse
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649
–
library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))
%>%
dplyr
equivalent to the +
in ggplot2
geom_point()
geom_line()
geom_histogram()
geom_boxplot()
geom_bar()
filter()
summarize()
group_by()
mutate()
arrange()
filter()
Select a subset of the rows of a data frame.
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)
==
to compare a variable to a value|
to check for any in multiple filters being true:gapminder %>%
filter(year == 2002 | continent == "Europe")
–
&
or ,
to check for all of multiple filters being true:gapminder %>%
filter(year == 2002, continent == "Europe")
%in%
to check for any being true |
repeatedly with ==
)gapminder %>%
filter(country %in% c("Argentina", "Belgium", "Mexico"),
year %in% c(1987, 1992))
–
summarize()
summarize()
.max_exp_1997 <- gapminder %>%
filter(year == 1997) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997
–
summarize()
with group_by()
When you’d like to determine a numerical summary for all levels of a different categorical variable
max_exp_1997_by_cont <- gapminder %>%
filter(year == 1997) %>%
group_by(continent) %>%
summarize(max_exp = max(lifeExp),
sd_exp = sd(lifeExp))
max_exp_1997_by_cont
ggplot2
revisitedFor aggregated data, use geom_col
ggplot(data = max_exp_1997_by_cont,
mapping = aes(x = continent, y = max_exp)) +
geom_col() +
geom_errorbar(mapping = aes(ymin = max_exp - sd_exp,
ymax = max_exp + sd_exp))
filter()
summarize()
group_by()
–
mutate()
–
arrange()
mutate()
–
gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)
mutate()
–
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)
mutate()
–
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)
arrange()
gapminder %>%
arrange(year, country)
arrange()
gapminder %>%
filter(year > 2000) %>%
arrange(desc(lifeExp)) %>%
head(10)
arrange
and group_by
group_by
is used (mostly) with summarize
to calculate summaries over groups
arrange
is used for sorting
arrange
and group_by
This doesn’t really do anything useful
gapminder %>% group_by(year)
arrange
and group_by
But this does
gapminder %>% arrange(year)
True or False
Each of
filter
,mutate
, andarrange
change the observational unit.
–
True or False
group_by() %>% summarize()
changes the observational unit.
class: center
–
library(fivethirtyeight)
library(readr)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police_locals, y = ideology, by = "city")
rmarkdown::paged_table(police_join)
cost_of_living <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/cost_of_living.csv")
police_join_cost <- inner_join(x = police_join, y = cost_of_living, by = "state")
rmarkdown::paged_table(police_join_cost)
ggplot(data = police_join_cost,
mapping = aes(x = index, y = all)) +
geom_point(aes(color = state_ideology)) +
labs(x = "Cost of Living Index", y = "% Officers Living in City")
Use the 5MV to answer problems from R data packages, e.g., [nycflights13
\(\rightarrow\) weather
]
What is the maximum arrival delay for each carrier departing JFK? [nycflights13
\(\rightarrow\) flights
]
Calculate the domestic return on investment for 2013 scaled data descending by ROI
[fivethirtyeight
\(\rightarrow\) bechdel
]
Include the name of the carrier
as a column in the flights
data frame
[nycflights13
\(\rightarrow\) flights
, airlines
]
class: inverse, center, middle
class: inverse, center, middle
We now enter the “traditional” topics of intro stats…
… but now students are armed with
Sampling is at the root of statistics. Two approaches:
Either we use this… | Or we use this… |
---|---|
mosaic
PackageIt has functions for random simulations:
rflip()
: Flip a coinshuffle()
: Shuffle a set of valuesdo()
: Do the same thing many, many, many timesresample()
: the swiss army knife for samplingThe example will be built around this code: (Available in the supplementary HTML document here.)
library(ggplot2)
library(dplyr)
library(mosaic)
single_cup_guess <- c(1, 0)
simulation <- do(10000) * resample(single_cup_guess, size=8, replace=TRUE)
View(simulation)
simulation <- simulation %>%
mutate(n_correct = V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8)
View(simulation)
ggplot(simulation, aes(x=n_correct)) +
geom_bar() +
labs(x="Number of Guesses Correct", title="Number Correct Assuming She Is Guessing at Random") +
geom_vline(xintercept=8, col="red")
There is only one test; it has 5 components:
Posed to students: Did students with an even # of letters in last name do better than those with odd #? (Available in the supplementary HTML document here.)
library(tidyverse)
library(mosaic)
grades <- read_csv("https://raw.githubusercontent.com/ismayc/moderndive-workshops/master/docs/slides/data/grades.csv")
View(grades)
# Observed Difference: Using mosaic package's mean function
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()
null_distribution <- (do(1000) * mean(final ~ shuffle(even_vs_odd), data=grades)) %>%
mutate(difference=odd-even)
View(null_distribution)
# Plot
ggplot(data=null_distribution , aes(x=difference)) +
geom_histogram(binwidth = 0.025) +
labs(x="Avg of Odds - Avg of Evens") +
geom_vline(xintercept = observed_diff, col="red")
The example will be built around this code: (Available in the supplementary HTML document here)
n
and repeating. What does this correspond to doing in real life?n
plays when it comes to estimating \(\mu\).library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
# For simplicity, remove 3 individuals who did not list their height
profiles <- profiles %>%
filter(!is.na(height))
# Population mean
mu <- mean(profiles$height)
# Sample size:
n <- 5
# Parts 1 & 2:
resample(profiles$height, size=n, replace=TRUE)
mean(resample(profiles$height, size=n, replace=TRUE))
# Part 3:
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=TRUE))
View(samples)
# Part 4:
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
labs(x="sample mean") +
xlim(c(50,80)) +
geom_vline(xintercept=mu, col="red")
# Part 5:
sd(samples$mean)
Convincing students:
ggplot2
package and knowledge of the Grammar of Graphics primes students for regressionbroom
package to unpack regressionggplot2
Primes Regressionggplot2
code.ggplot2
Primes RegressionExample:
Involves four variables:
carrier
, temp
, dep_delay
, summer
ggplot2
Primes Regressionggplot2
Primes RegressionWhy? Dig deeper into data. Look at origin
and dest
variables as well:
broom
Packagebroom
package takes the messy output of built-in modeling functions in R, such as lm
, nls
, or t.test
, and turns them into tidy data frames.tidyverse
ecosystembroom
PackageIn our case, broom
functions take lm
objects as inputs and return the following in tidy format!
tidy()
: regression output tableaugment()
: point-by-point values (fitted values, residuals, predicted values)glance()
: scalar summaries like \(R^2\),broom
PackageThe chapter will be built around this code: (Available in the supplementary HTML document here).
library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)
# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
sample_n(50)
View(alaska_flights)
# Exploratory Data Analysis----------------------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
# Correlation coefficient:
alaska_flights %>%
summarize(correl = cor(dep_delay, arr_delay))
# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
# Fit Regression and Study Output with broom Package---------------------------
# Fit regression
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)
# 1. broom::tidy() regression table with confidence intervals and no p-value stars
regression_table <- delay_fit %>%
tidy(conf.int=TRUE)
regression_table %>%
kable(digits=3)
# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>%
augment() %>%
select(arr_delay, dep_delay, .fitted, .resid)
regression_points %>%
head() %>%
kable(digits=3)
# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>%
augment(newdata = new_flights) %>%
kable()
# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>%
glance()
regression_summaries %>%
kable(digits=3)
# Residual Analysis------------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
geom_histogram(binwidth=10) +
geom_vline(xintercept = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_abline(intercept = 0, slope = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
stat_qq()
# Preview of Multiple Regression-----------------------------------------------
flights_subset <- flights %>%
filter(carrier == "AS" | carrier == "F9") %>%
left_join(weather, by = c("year", "month", "day", "hour", "origin")) %>%
filter(dep_delay < 250) %>%
mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))
ggplot(data = flights_subset, mapping = aes(x = temp, y = dep_delay, col = carrier)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~summer)
class: inverse, center, middle
By July 1st, 2017
knitr::kable()
output orView()
function.tutorial
packageclass: inverse, center, middle
bookdown
txt
file.left-column[
# Header 1
## Header 2
Normal paragraphs of text go here.
**I'm bold**
[links!](http://rstudio.com)
* Unordered
* Lists
And Tables
---- -------
Like This
]
.right-column[ ]
.left-column[
```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>%
filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```
```{r chunk2}
ggplot(data = pdx_flights,
mapping = aes(x = arr_delay,
y = dep_delay)) +
geom_point()
```
]
.right-column[
[1] 88
]
bookdown
?From bookdown book about bookdown
:
Author books with R Markdown, including generating figures and tables, and inserting cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org).
class: inverse, center, middle
bookdown
class: middle