Carefully read over the directions and be sure that you are using the appropriate data set as specified in each problem statement. Choosing an incorrect data set will cause you problems and hurt your grade. As a general note, be careful and check over all your problems a couple times to make sure you have completed all tasks!

The point totals for each problem are given after the problem number. I encourage you to carefully look these over as you work through the exam.

Note: If you’d like to enter a table in a “nice” format in markdown you can do so with something like:

name role age
Chester R guy 33
Doug Carpenter 60
Susan Astronaut 42

Student Learning Objectives

This exam is designed to test your understanding of the following Student Learning Objectives from the syllabus:

  • Create tidy data sets by carefully identifying the observational units and the types of variables that make up the data set.
  • Make and interpret the Five Named Graphs: histograms, boxplots, barplots, scatter-plots, and line-graphs.
  • Use multivariate thinking by producing and discussing appropriate plots showing the relationships between three (or more) variables.
  • Identify the appropriate plot and type of analysis needed to answer a given social research question.

Here are the rules for the exam:

  1. You are to take this exam by yourself. You shouldn’t be asking any questions of your classmates or discussing this exam in any way with your classmates or anyone else that may be able to help you with this exam. The only exception is Chester, but he will only answer clarifying questions. (Remember that I don’t know how to start. is not a good request for help.)

  2. You can use the course textbooks, your labs (with my corrections), your homework and my answers, and any links directly from the class webpage at http://ismayc.github.io/teaching/soc301-f2016/. You cannot ask questions directly of someone on the Internet asking for help, but you can Google if you’d like assistance with something ggplot2 related. I urge you to stick to only using the types of code we have discussed in class though. If you aren’t sure about whether some R code is OK to use, you should be emailing me and if you really aren’t sure, you probably shouldn’t be using it.

  3. Your plots should be done using the ggplot2 package and I recommend you load this package in the first R chunk. I will not accept plots (you will receive no credit) made using base R or other plotting systems in R or non-R tools.

  4. Your exam should be reproducible. In other words, you should not be copying R output into the R Markdown file and all of the R code needed should be stored in chunks with the desired output immediately following the chunks. You’ll potentially lose all points for the Problem if this policy is not followed.

  5. Typos will lead to a deduction of points, likely a significant number. You all have the ability to spellcheck and it shows that you have pride in your work and that you value my time when you do so. I also encourage you to have a friend / tutor read over your exam to make sure your sentences make sense. (DON’T ASK THEM FOR STATISTICAL/R HELP THOUGH!)


Agreement to exam rules

Please type your first and last name below next to NAME: and the DATE: acknowledging that you will follow these rules. Failure to follow these rules may result in failing the course due to academic dishonesty.

NAME:

DATE:


This exam will focus on using data made famous by Hans Rosling in his TED talk from 2006 entitled “The Best Stats You’ve Ever Seen.” While you’re not required to watch the video for this exam, it’s a really interesting talk and Hans has a unique way of discussing statistical ideas. The video is available here. The data discussed there has been updated and is now stored on the Gapminder project at http://www.gapminder.org. (You don’t need to go to the website, but it is a nice reference.)

Problem 1 (10 points)

One of the data sets presented on Gapminder gives a “democracy score” to each country for each year. I’ve extracted years 1952 to 2007 in five year increments in the CSV below. The “democracy score” is lowest at -10 and highest at 10. The more negative a value, the more non-democratic (autocratic) the country is rated to be. The more positive a value, the more democratic the country is rated to be. Run the R chunk below, which loads this data set into a data frame in R.

library(readr)
dem_score <- read_csv("dem_score.csv")
  1. Is the dem_score data frame a tidy data frame?

    Answer:

  2. If so, clearly explain how each of the properties of a tidy data set are met. If not, explain how many variables a corresponding tidy data set would have and layout what five rows of this tidy data set would look like using the note at the beginning of the exam.

    Answer:


In the remaining problems on this exam, you’ll be either directly working with or using a subset of the data set read in the following chunk:

gap <- read_csv("gapminder.csv")

Problem 2 (10 points)

Use the str function in the blank R chunk below to describe the different types of variables in this gap dataset.

  1. Which are categorical?

    Answer:

  2. Which are numeric?

    Answer:

  3. Why is one of the variables labeled as int?

    Answer:

  4. Run the R chunk below:

    gap$dem_rank <- factor(gap$dem_rank, 
                       levels = c("Strongly Autocratic", 
                                  "Mildly Autocratic",
                                  "Middle of the Road", 
                                  "Mildly Democratic", 
                                  "Strongly Democratic")
                       )
    1. What is the purpose of this R chunk?

      Answer:

    2. How does the factor function help with plots? (You may be able to better answer this question after working through the entire exam.)

      Answer:


Problem 3 (10 points)

You may have already guessed by running View(gap) that the variables correspond to measurements on a country in a year. Specifically,

  • lifeExp corresponds to (average) life expectancy,
  • pop corresponds to an estimate of population,
  • gdpPercap corresponds to gross domestic product per capita, and
  • dem_score is a categorical version of the values in the dem_score data frame from earlier.
  1. What is the observational unit for this gap data frame? Remember to be as specific as possible.

    Answer:

  2. The following code attempts to extract all values for pop in gap except for entries with indices 1, 10, 14, 15, and 1000 and assigns that to a new vector called few_pop. Explain what is wrong with the code (there are at least 6 things) and correct the code to give the desired result.

    few_pop -> gap%pop[[c(-1, 10, 14-15), 1000]]

    Answer:


Problem 4 (10 points)

As you look over the gap data frame, you’ll notice that each country has 12 entries for the 12 different years with information. For the purposes of this problem, we will focus on only 2007. (You’ll see more details on doing procedures like this in Chapter 5 and on your next lab.) The chunk below creates a new data frame focused only on a year value of 2007 in the gap data frame.

gap2007 <- dplyr::filter(gap, year == 2007)
  1. Produce an appropriate plot looking at the frequency of countries by subRegion in the gap2007 data frame. Also, fill based on region. Your horizontal axis labels will likely be jumbled up so you need to add (+) theme(axis.text.x = element_text = 60, hjust = 1) as a layer to the plot.

  2. Which subRegion has the most countries?

    Answer:

  3. Which subRegion appears in two different regions?

    Answer:

  4. Which region is made up entirely of only one subRegion?

    Answer:


Problem 5 (20 points)

  1. Using the gap2007 data frame, produce an appropriate plot comparing the relationship of gdpPercap on the horizontal axis and lifeExp on the vertical axis.

  2. Would you describe the overall relationship as strongly linear? Explain in two or three sentences what the pattern shows.

    Answer:

  3. Looking over this plot and focusing on gdpPercap smaller than 5000, how many countries have an average life expectancy over 75?

    Answer:

  4. Does the country with the highest gdpPercap also have the highest lifeExp? If not, around what gdpPercap corresponds to the highest lifeExp?

    Answer:

  5. EXTENDING KNOWLEDGE Produce the same plot asked for in part a. but now change the transparency based on values of dem_rank and change the size to be based on population.

  6. Describe in two or three sentences how this plot provides more insights into your responses to parts c. and d.

    Answer:


Problem 6 (20 points)

When you use View(gap2007), you’ll notice that many of the entries for dem_rank have a value of NA. We can remove all rows that have any of these NA values and create a new smaller data set using

gap2007small <- na.omit(gap2007) 
  1. What do NA values mean in R?

    Answer:

  2. What are some reasons why NA values may exist in the data? (This is more of a practical question than a statistical one.)

    Answer:

  3. Produce an appropriate plot looking at the distribution of gdpPercap in gap2007small. Change the color of the inside of the plotted values and the border of the plotted values as you have done in labs and in class. (The words aren’t intended to be vague here, but I don’t want to tell you the answer of which plot to make…)

  4. Produce an appropriate plot looking at the distribution of gdpPercap over dem_rank in gap2007small. Make sure to tweak the default color settings to your liking.

  5. EXTENDING KNOWLEDGE Produce a faceted boxplot (you read that right) looking at the distribution of gdpPercap over values of region in combination with dem_rank in gap2007small. Note that you’ll need to add the tilting of the axis labels via theme(axis.text.x = element_text = 60, hjust = 1) again.

  6. EXTENDING KNOWLEDGE Focusing on the Strongly Democratic small multiple plot produced in part e., describe how the 25th, 50th, and 75th percentile vary across the levels of region.

    Answer:


Problem 7 (10 points)

Another feature we will see in Chapter 5 is the ability to choose specific values from a list and filter the data set according to this. Below we will pick 6 countries from different regions in the world and then make an appropriate plot to see how each country’s life expectancy has changed over time.

country_list <- c("Argentina", "United States", "Liberia", "Pakistan", "Finland", "New Zealand")
six_countries <- dplyr::filter(gap, country %in% country_list)

Now run View(six_countries) in the R console to get a sense for what this new data set looks like.

  1. EXTENDING KNOWLEDGE Produce the appropriate plot showing how the life expectancy from each country in six_countries has changed over time. You should color based on subRegion.

  2. What is the name of the country in six_countries that had the lowest life expectancy in 1970?

    Answer:

  3. Which three countries overlapped on the plot?

    Answer:

  4. What is the only country to show a decline in life expectancy? In what years did this drop occur?

    Answer:

  5. Which country showed the greatest increase in life expectancy from 1952 to 2007?

    Answer:


Problem 8 (8 points)

Identify AT LEAST two variables that have not been analyzed together so far on this exam in the gap data set. Produce a plot looking at the relationship between the variables and discuss in three or four sentences the major findings from your plot.

Answer:


Closing notes

PRESS THE SPELLCHECK BUTTON!

Make sure to Knit HTML your Rmd when you are done. You may also find errors by knitting so you are encouraged to Knit HTML frequently as you work on this exam.


Reflect (2 points for completion)

Just as with the labs, I want you to take a bit of time to reflect on what you learned via this take home exam. I expect some of the problems to be challenging but I believe you are all capable of figuring them out if you push yourselves to learn more and review the material we have covered so far. Resist the urge to go to the book immediately for an answer or go to Google to ask questions. You should try to think about a problem for 15 minutes or so to see if an answer comes to you first. You may also find value in skipping problems that you don’t immediately know how to do as other problems may guide you to the correct reasoning.

Answer each of these reflections in two or more complete sentences for full credit.

  • What surprised you the most about this exam?

    Response:

  • Looking back over the Take Home Exam - Study Guide, do you see how many of the points there were addressed on this exam? If not, which Problems here do you believe were not addressed?

    Response:

  • You’ve come a long way already in learning how to use R. How do you feel today compared to how you felt in the first week or two of class with regards to this?

    Response:

  • What have you learned from a social perspective by analyzing the gap data? What new questions do you have about the data?

    Response:

