A Fully Customizable Textbook for Introductory Statistics/Data Science

# A Fully Customizable Textbook for Introductory Statistics/Data Science
## USCOTS 2017 Workshop
### Chester Ismay and Albert Y. Kim
### May 17 & 18, 2017 Slides at <a href="http://bit.ly/uscots17-slides" class="uri">http://bit.ly/uscots17-slides</a> Supplementary HTML document at <a href="http://bit.ly/uscots17-html" class="uri">http://bit.ly/uscots17-html</a>

---

# Introduction

## Who We Are

* [Chester Ismay](https://ismayc.github.io/): Reed College & Pacific University
 + Email: <chester.ismay@gmail.com>
 + GitHub: [`ismayc`](https://github.com/ismayc)
 + Twitter: [`@old_man_chester`](https://twitter.com/old_man_chester)
* [Albert Y. Kim](http://rudeboybert.github.io/): Middlebury College
 + Email: <albert.ys.kim@gmail.com>
 + GitHub: [`rudeboybert`](https://github.com/rudeboybert)
 + Twitter: [`@rudeboybert`](https://twitter.com/rudeboybert)

---

## Outline of Workshop

[Google Doc at <http://bit.ly/uscots17-agenda>](https://docs.google.com/document/d/12Ai7wxK5OTrIwwrSJQewXHcqpMJ09lB-HkGN3BbShy4/edit?usp=sharing)

---

## Our Textbook

* *An Introduction to Statistical and Data Sciences via R*
* Webpage: <http://moderndive.com>. [GitHub Repo](https://github.com/ismayc/moderndiver-book)
* In you haven't already, please [signup](http://moderndive.us15.list-manage2.com/subscribe?u=87888fab720da90906427a5be&id=0c9e2d1df2) for our mailing list!

---

## Albert's Course (Intro to Statistical & Data Sciences)

Available in the supplementary HTML document [here](https://ismayc.github.io/moderndive-workshops/slides/slide_document.html#introduction).

* [Webpage](https://rudeboybert.github.io/MATH116/) and [GitHub Repo](https://github.com/rudeboybert/MATH116)
* Administrative:
 + Chief non-econ/bio stats service class at Middlebury
 + 12 weeks each with 3h "lecture" + 1h "lab"
 + No prerequisites
* Students:
 + ~24 students/section of all years/backgrounds. Only stats class many will take
 + Background: Many had AP stats, some with programming
 + All had laptops that they brought everyday
* [Topic List](https://rudeboybert.github.io/MATH116/)
 + First half is data science: data visualization, manipulation, importing
 + Second half is intro stats: sampling, hypothesis tests, CI, regression
* Evaluation
 + 10%: weekly problem sets
 + 10%: engagement
 + 45%: 3 midterms (last during finals week)
 + 35%: [Final projects](https://rudeboybert.github.io/MATH116/PS/final_project/final_project_outline.html#learning_goals)
* Typical Classtime:
 + First 10-15min: Priming topic, either via slides or chalk talk
 + Remainder: Students read over text & do Learning Checks in groups and without direct instructor guidance.

---

## Chester's Course (Social Statistics)

Available in the supplementary HTML document [here](https://ismayc.github.io/moderndive-workshops/slides/slide_document.html#introduction)

* [Webpage at <http://bit.ly/soc-301>](https://ismayc.github.io/soc301_s2017/) and [GitHub Repo](https://github.com/ismayc/soc301_s2017)
* Administrative:
 + Chief stats service class for sociology/criminal justice
 + An option take to fulfill the Pacific U. math requirement
 + 14 weeks, meeting on Tues & Thurs for 95 minutes
 + No prerequisites
* Students:
 + 26 students of all years/backgrounds. Only stats class many will take
 + Background: 3 had AP stats, zero with programming
 + All had laptops that they brought everyday
* [Course Schedule](https://ismayc.github.io/soc301_s2017/schedule/)
 + First half is data science: data visualization, wrangling, importing
 + Second half is intro stats: sampling, testing, CI
* [Evaluation](https://ismayc.github.io/soc301_s2017/syllabus/)
 + 5%: Engagement/Pass-fail Learning Checks
 + 10%: DataCamp/article summarizing assignments
 + 15%: [Group Project](https://ismayc.github.io/soc301_s2017/group-projects/index.html) 
 + 20%: Pencil-and-paper Midterm Exam
 + 25%: (5) Multiple choice cumulative quizzes
 + 25%: Cumulative Pencil-and-paper Final Exam
* Typical Classtime:
 + First 5-10min: Students answer warmup exercise based on previous content
 + Next 10-20min: Review reading assignment via [slides](http://ismayc.github.io/soc301_s2017/slides/slide_deck.html)
 + Bulk of class: 
 - Students read over text & do Learning Checks in groups and without direct instructor guidance. 
 - Students work on next DataCamp problems and ask questions as needed
 + Last 5-10min: Go over warmup exercise again or quiz students on material from that period

---

## What Are We Doing And Why?

1. Data first! Start with data science via `tidyverse`, then stats builds on these ideas.
1. Replacing the mathematical/analytic with computational/simulation-based whenever possible.
1. The above necessitates algorithmic thinking, computational logic and some coding/programming.
1. Complete reproducibility

---

## 1) Data First!

Cobb ([TAS 2015](https://arxiv.org/abs/1507.05346)): *Minimizing prerequisites to research*. In other words, focus on entirety of Wickham/Grolemund's pipeline...

![](figure/pipeline.png)

---

## 1) Data First!

Furthermore use data science tools that a data scientist would use. Example: [`tidyverse`](http://tidyverse.org/)

---

## 1) Data First!

What does this buy us?

* Students can do effective data storytelling
* Context for asking scientific questions
* Look at data that's rich, real, and realistic. Examples: Data packages such as [`nycflights13`](https://github.com/hadley/nycflights13) and [`fivethirtyeight`](https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html)
* Better motivate traditional statistical topics

---

## 2) Computers, Not Math!

Cobb ([TAS 2015](https://arxiv.org/abs/1507.05346)): Two possible "computational
engines" for statistics, in particular relating to sampling:

* Mathematics: formulas, probability theory, large-sample approximations, central limit theorem

* Computers: simulations, resampling methods

---

## 2) Computers, Not Math!

We present students with a choice for our "engine":

Either we use this... | Or we use this...
:-------------------------:|:-------------------------:
<img src="figure/formulas.png" alt="Drawing" style="width: 250px;"/> | <img src="figure/coding.jpg" alt="Drawing" style="width: 250px;"/>

* Almost all are thrilled to do the latter

* Leave "bread crumbs" for more advanced math/stats courses

---

## 2) Computers, Not Math!

What does this buy us?

* Emphasizes: stats is not math, rather stats uses math.
* Simulations are more tactile
* Reducing probability and march to CLT, this frees up space in syllabus.

---

## 3) Algorithms, Computation, & Coding

* Both "Data First!" and "Computers, Not Math!" necessitate algorithmic thinking, computational logic, and some coding/programming.
* Battle is more psychological than anything:
    + "This is not a class on programming!"
    + "Computers are stupid!"
    + "Learning to code is like learning a foreign language!"
    + "Early on don't code from scratch! Take something else that's similar and tweak it!"
    + Learning how to Google effectively

---

## 3) Algorithms, Computation, & Coding

Why should we do this?

* Data science and machine learning.
* Where statistics is heading. Gelman [blog post](http://andrewgelman.com/2017/05/14/computer-programming-prerequisite-learning-statistics/).
* If we don't, we are doing a disservice to students by shielding them from these computational ideas.
* Bigger picture: Coding is becoming a basic skill like reading and writing.

---

## 4) Complete Reproducibility

* Students learn best when they can take apart a toy (analysis) and then rebuild it (synthesis).
* Crisis in Reproducibility
* Ultimately the best textbook is one you've written yourself.
    + Everyone has different contexts, backgrounds, needs
    + Hard to find one-size-fits-all solutions
* A new paradigm in textbooks? [Versions, not editions?](https://twitter.com/rudeboybert/status/820032345759592448)

---

## Let's Dive In!

---

## Baby's First Bookdown

* ModernDive Light: Just Data Science Chapters of Bookdown
* Download this ZIP file & extract the contents to a folder on your computer [`master.zip`](https://github.com/ismayc/moderndiver-lite/archive/master.zip)
* Double click `moderndiver-lite.Rproj` to open in RStudio
* Build -> Build Book
    - `install.packages('knitr', repos = c('http://rforge.net', 'http://cran.rstudio.org'), type = 'source')`

---

# Getting Started

## DataCamp

DataCamp offers an interactive, browser based tool for learning R/Python. Their
two flagship R courses, both of which are free:

* [Intro to R](https://www.datacamp.com/courses/free-introduction-to-r) 
* [Intermediate R](https://www.datacamp.com/courses/intermediate-r-practice) courses

---

## DataCamp

Outsource many essential but not fun to teach topics like

* Idea of command-line vs point-and-click
* Syntax: Variable names typed exactly, parentheses matching
* Algorithmic Thinking: Linearity of code, object assignment
* Computational Logic: boolean algebra, conditional statements, functions

---

## DataCamp Pros

* Can assign "Intro to R" first day of class as "Homework 0"
* Outsourcing allows you to free up class time
* Students get immediate feedback on whether or not their code works
    + Often, the DataCamp error messages are much more useful than the ones R gives

---

## DataCamp Pros
    
* With their [free academic license](https://www.datacamp.com/groups/education), you can
    + Form class "Groups" and assign/track progress of DataCamp courses
    + Have free access to ALL [their courses](https://www.datacamp.com/courses), including `ggplot2`, `dplyr`, `rmarkdown`, and RStudio IDE.
    + Create your own free DataCamp course covering content you want your students to learn using R

---

## DataCamp Cons

* Some students will still have trouble; you can identify them however.
* The topics in these two free courses may not align with your syllabus. You can assign at chapter level instead of course level though

---

## DataCamp Conclusion

* Not a good tool for "quick retention," but for R concept introduction and subsequent repetition.
  + Students need to practice "speaking the language" just like with a foreign language.
* [Feedback](https://docs.google.com/spreadsheets/d/1qUwt-v-xAQJ-1OTzMI2Q27McXG_3yXO5Qe9p-jz6BA8/edit) from students was positive.
* Battle is more psychological than anything. DataCamp reinforces to students that
    + "Computers are stupid!"
    + "Learning to code is like learning a foreign language!"

---

## Chester's First Bookdown Project

[Getting used to R, RStudio, and R Markdown](https://ismayc.github.io/rbasics-book/)

- Designed to provide students with GIFs to follow along with and a description of all the components of RStudio and R Markdown

---

## Short break?

---

## Important R ideas for students to know ASAP

Vector/variable
  - Type of vector (`int`, `num`, `chr`, `logical`, `date`)

Data frame
  - Vectors of (potentially) different types
  - Each vector has the same number of rows
  
---

class: center, middle 
 
# Welcome to the [tidyverse](https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/)!
 
The `tidyverse` is a collection of R packages that share common philosophies and are designed to work together. 
 
<a href="http://tidyverse.tidyverse.org/logo.png"><img src="figure/tidyverse.png" style="width: 200px;"/></a>

---

# Chapter 3: Tidy Data?

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

The third point means we don't mix apples and oranges.

---

## What is Tidy Data?

1. Each observation forms a row. In other words, each row corresponds to a single instance of an observational unit
1. Each variable forms a column:
 + Some variables may be used to identify the observational units. 
 + For organizational purposes, it's generally better to put these in the left-hand columns
1. Each type of observational unit forms a table.

---

## Differentiating between neat data and tidy data

- Colloquially, they mean the same thing
- But in our context, one is a subset of the other.

Neat data is 
 - easy to look at, 
 - organized nicely, and 
 - in table form.

Tidy data is neat but also abides by a set of three rules.

---

---

## Is this tidy?

```
# A tibble: 12 × 4
 year title clean_test budget_2013
 <int> <chr> <chr> <int>
1 1995 Apollo 13 ok 99370665
2 2005 Brokeback Mountain notalk 16583160
3 2010 Diary of a Wimpy Kid ok 16023478
4 1984 Dune dubious 100864980
5 1984 Ghostbusters notalk 67243320
6 2003 How to Lose a Guy in 10 Days men 63304348
7 2011 Iris ok 5696299
8 2004 Sideways ok 20964279
9 2000 Songcatcher ok 2435235
10 2004 Team America: World Police men 24663858
11 2010 Tron Legacy notalk 213646368
12 2011 War Horse notalk 72498355
```

---

## How about this? Is this tidy?

```
# A tibble: 12 × 13
 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
 <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 Albania -9 -9 -9 -9 -9 -9 -9 -9
2 Argentina -9 -1 -1 -9 -9 -9 -8 8
3 Armenia -9 -7 -7 -7 -7 -7 -7 -7
4 Australia 10 10 10 10 10 10 10 10
5 Austria 10 10 10 10 10 10 10 10
6 Azerbaijan -9 -7 -7 -7 -7 -7 -7 -7
7 Belarus -9 -7 -7 -7 -7 -7 -7 -7
8 Belgium 10 10 10 10 10 10 10 10
9 Bhutan -10 -10 -10 -10 -10 -10 -10 -10
10 Bolivia -4 -3 -3 -4 -7 -7 8 9
11 Brazil 5 5 5 -9 -9 -4 -3 7
12 Bulgaria -7 -7 -7 -7 -7 -7 -7 -7
# ... with 4 more variables: `1992` <int>, `1997` <int>, `2002` <int>,
# `2007` <int>
```

[Why is tidy data important?](#whytidy) slide

---

## Beginning steps

Frequently the first thing to do when given a dataset is to

- check that the data is tidy,
- identify the observational unit,
- specify the variables, and
- give the types of variables you are presented with.

This will help with

- choosing the appropriate plot, 
- summarizing the data, and 
- understanding which inferences can be applied.

---

# Chapter 4: Data Viz

Inspired by [Hans Rosling](https://www.youtube.com/watch?v=jbkSRLYSojo)

---

- What are the variables here?
- What is the observational unit?
- How are the variables mapped to aesthetics?

---

## Grammar of Graphics

Wilkinson (2005) laid out the proposed "Grammar of Graphics"

---

## Grammar of Graphics in R

Wickham implemented the grammar in R in the `ggplot2` package

---

## What is a statistical graphic?

## A `mapping` of `data` variables

## to `aes()`thetic attributes

## of `geom_`etric objects.

---

## Back to basics

---

### Consider the following data in tidy format:

```
# A tibble: 4 × 4
 A B C D
 <dbl> <dbl> <dbl> <chr>
1 1980 1 3 low
2 1990 2 2 low
3 2000 3 1 high
4 2010 4 2 high
```

- Sketch the graphics below on paper, where the `x`-axis is variable `A` and the `y`-axis is variable `B`

1. A scatter plot
1. A scatter plot where the `color` of the points corresponds to `D`
1. A scatter plot where the `size` of the points corresponds to `C`
1. A line graph
1. A line graph where the `color` of the line corresponds to `D` with points added that are all green of size 4.

---