Overview

Due before class on October 16th.

Fork the hw02 repository

Go here to fork the repo for homework 02.

Part 1: Exploring clean data (7 points)

FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics, and culture, recently published a series of articles on gun deaths in America. Gun violence in the United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.

Obtain the data

I have included this dataset in the rcfss library on GitHub. To install the package, use the command devtools::install_github("uc-cfss/rcfss") in R. If you don’t already have the devtools library installed, you will get an error. Go back and install this first using install.packages(), then install rcfss. The gun deaths dataset can be loaded using data("gun_deaths"). Use the help function in R (?gun_deaths) to get detailed information on the variables and coding information.

Explore the data

Very specific prompts

  1. Generate a data frame that summarizes the number of gun deaths per month.
    1. Print the data frame as a formatted kable() table.
    2. Generate a bar chart with human-readable labels on the x-axis. That is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.
  2. Generate a bar chart that identifies the number of gun deaths associated with each type of intent cause of death. The bars should be sorted from highest to lowest values.
  3. Generate a boxplot visualizing the age of gun death victims, by sex. Print the average age of female gun death victims.

More open-ended questions

Answer the following questions. Generate appropriate figures/tables to support your conclusions.

  1. How many white males with at least a high school education were killed by guns in 2012?
  2. Which season of the year has the most gun deaths? Assume that
    • Winter = January-March
    • Spring = April-June
    • Summer = July-September
    • Fall = October-December
    • Hint: you need to convert a continuous variable into a categorical variable. Find a function that does that.
  3. Are whites who are killed by guns more likely to die because of suicide or homicide? How does this compare to blacks and hispanics?

Formatting graphs

While you are practicing data analysis, your final graphs should be appropriate for sharing with outsiders. That means your graphs should have:

  • A title
  • Labels on the axes (see ?labs for details)

This is just a starting point. Consider adopting your own color scales, taking control of your legends (if any), playing around with themes, etc.

Formatting tables

When presenting tabular data (aka dplyr::summarize()), make sure you format it correctly. Use the kable() function from the knitr package to format the table for the final document. For instance, this is a poorly presented table summarizing where gun deaths occurred:

library(tidyverse)
library(knitr)
library(rcfss)
# calculate total gun deaths by location
count(gun_deaths, place)
## # A tibble: 11 x 2
##    place                       n
##    <chr>                   <int>
##  1 Farm                      470
##  2 Home                    60486
##  3 Industrial/construction   248
##  4 Other specified         13751
##  5 Other unspecified        8867
##  6 Residential institution   203
##  7 School/instiution         671
##  8 Sports                    128
##  9 Street                  11151
## 10 Trade/service area       3439
## 11 <NA>                     1384

Instead, use kable() to format the table, add a caption, and label the columns:

count(gun_deaths, place) %>%
  kable(caption = "Gun deaths in the United States (2012-2014), by location",
        col.names = c("Location", "Number of deaths"))
Gun deaths in the United States (2012-2014), by location
Location Number of deaths
Farm 470
Home 60486
Industrial/construction 248
Other specified 13751
Other unspecified 8867
Residential institution 203
School/instiution 671
Sports 128
Street 11151
Trade/service area 3439
NA 1384

Run ?kable in the console to see how additional options.

Note that when viewed on GitHub, table captions will not show up. Just a (missing) feature of Markdown on GitHub 😔

Part 2: Tidying messy data (4 points)

In the rcfss package, there is a data frame called dadmom.

## # A tibble: 3 x 5
##   famid named  incd namem  incm
##   <dbl> <chr> <dbl> <chr> <dbl>
## 1     1 Bill  30000 Bess  15000
## 2     2 Art   22000 Amy   18000
## 3     3 Paul  25000 Pat   50000

Tidy this data frame so that it adheres to the tidy data principles:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

NOTE: You can accomplish this task in a single piped operation using only tidyr functions. Code which does not use tidyr functions is acceptable, but will not merit a “check plus” on your evaluation.

Part 3: Joining data frames (2 points)

Recall the gapminder data frame we previously explored. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join the original gapminder data frame with a new data file containing the HIV prevalence rate in the country.1

The HIV prevalence rate is stored in the data folder as a CSV file. You need to import and merge the data with gapminder to answer these two questions:

  1. What is the relationship between HIV prevalence and life expectancy? Generate a scatterplot with a smoothing line to report your results.
  2. Which continents have the most observations with missing HIV data? Generate a bar chart, ordered in descending height (i.e. the continent with the most missing values on the left, the continent with the fewest missing values on the right).

For each question, you need to perform a specific type of join operation. Think about what type makes the most sense and explain why you chose it.

Part 4: Exploring the General Social Survey (7 points)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. It is conducted biannually through in-person interviews using a probability sampling approach. It is one of the most commonly studied datasets in the social science disciplines.

Using the graphical data analysis skills we have reviewed in-class, you will conduct an exploratory analysis of the data to identify interesting questions and (potential) answers. Remember the types of questions we seek to answer using GDA:

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?
  3. Are there outliers in the data?
  4. Do I have missingness? Are there patterns to it?
  5. How much variation/error exists in my statistical estimates? Is there a pattern to it?

What not to do

Build a statistical model

No complex statistical methods should be employed. Focus instead on primarily graphical analysis, though you can also use basic statistical tests you may have learned in other classes (e.g. tests for normality, difference of means).

Adjust for survey weights

Do not worry about using survey weights in your exploratory analysis. Just treat every observation equally.

What you should do

The final submission should include two components.

Lab notebook (4 points)

This is a record of all your exploratory analysis. It should be extensive (minimum 30-40 graphs), and mostly code and graphics.

  • Minimally annotate your code and output as necessary to keep track of what you’ve done and highlight important insights gained through your exploration
  • It should be somewhat stream-of-conscious (that is, a stored record of your exploration as you explore the data), though certainly feel free to maintain a structure or go back and reformat different sections
  • Don’t bother cleaning up each graph to have meaningful labels

Exploration write-up (3 points)

In a short paper (around 750 words), summarize your insights and what you’ve learned about the data. This could include one or two important research questions you think you could answer using the data, as well as some initial hypotheses supported by your exploratory analysis. Or perhaps you’ve identified unusual variation in a single variable, or extreme outliers or systematic missingness in the data that should be accounted for in future analysis. This component will look different for each student. That’s fine. What I want to see is genuine effort and some thought put into what you’ve learned from this GDA.

  • This component should include mostly written analysis and a handful of graphs to support your questions and answers
  • Clean up these graphs so they are publication-ready. This means give each graph a meaningful title, axes labels, legends, etc.

Accessing the data

You can access this data file in the poliscidata package:

install.packages("poliscidata")
data(gss, package = "poliscidata")

# convert to tibble
library(tidyverse)
gss <- as_tibble(gss)

Dataset documentation

In the documentation folder, there are three files that are potentially relevant to your analysis.

  • codebook.txt - a codebook of the dataset automatically generated by Stata
  • GSS_Codebook_index.pdf - a list of all variables available from the GSS, with their variable names in the data file and a brief description of the variable
  • GSS_Codebook_mainbody.pdf - a detailed description of all variables available from the GSS, with full question wording and potential responses

You can also find more information on the survey and specific variables at the GSS website.

Submit the assignment

Your assignment should be submitted as a set of R Markdown documents. Don’t know what an R Markdown document is? Read this! Or this! I have included starter files for you to modify to complete the assignment, so you are not beginning completely from scratch.

Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.


  1. More specifically, the estimated number of people living with HIV per 100 population of age group 15-49.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.