Overview

Due before class on October 16th.

Fork the `hw02` repository

Go here to fork the repo for homework 02.

Part 1: Exploring clean data (7 points)

FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics, and culture, recently published a series of articles on gun deaths in America. Gun violence in the United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.

Obtain the data

I have included this dataset in the rcfss library on GitHub. To install the package, use the command devtools::install_github("uc-cfss/rcfss") in R. If you don’t already have the devtools library installed, you will get an error. Go back and install this first using install.packages(), then install rcfss. The gun deaths dataset can be loaded using data("gun_deaths"). Use the help function in R (?gun_deaths) to get detailed information on the variables and coding information.

Explore the data

Very specific prompts

Generate a data frame that summarizes the number of gun deaths per month.
1. Print the data frame as a formatted kable() table.
2. Generate a bar chart with human-readable labels on the x-axis. That is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.
Generate a bar chart that identifies the number of gun deaths associated with each type of intent cause of death. The bars should be sorted from highest to lowest values.
Generate a boxplot visualizing the age of gun death victims, by sex. Print the average age of female gun death victims.

Formatting graphs

While you are practicing data analysis, your final graphs should be appropriate for sharing with outsiders. That means your graphs should have:

A title
Labels on the axes (see ?labs for details)

This is just a starting point. Consider adopting your own color scales, taking control of your legends (if any), playing around with themes, etc.

Formatting tables

When presenting tabular data (aka dplyr::summarize()), make sure you format it correctly. Use the kable() function from the knitr package to format the table for the final document. For instance, this is a poorly presented table summarizing where gun deaths occurred:

library(tidyverse)
library(knitr)
library(rcfss)

# calculate total gun deaths by location
count(gun_deaths, place)

## # A tibble: 11 x 2
##    place                       n
##    <chr>                   <int>
##  1 Farm                      470
##  2 Home                    60486
##  3 Industrial/construction   248
##  4 Other specified         13751
##  5 Other unspecified        8867
##  6 Residential institution   203
##  7 School/instiution         671
##  8 Sports                    128
##  9 Street                  11151
## 10 Trade/service area       3439
## 11 <NA>                     1384

Instead, use kable() to format the table, add a caption, and label the columns:

count(gun_deaths, place) %>%
  kable(caption = "Gun deaths in the United States (2012-2014), by location",
        col.names = c("Location", "Number of deaths"))

Gun deaths in the United States (2012-2014), by location
Location	Number of deaths
Farm	470
Home	60486
Industrial/construction	248
Other specified	13751
Other unspecified	8867
Residential institution	203
School/instiution	671
Sports	128
Street	11151
Trade/service area	3439
NA	1384

Run ?kable in the console to see how additional options.

Note that when viewed on GitHub, table captions will not show up. Just a (missing) feature of Markdown on GitHub 😔

Part 2: Tidying messy data (4 points)

In the rcfss package, there is a data frame called dadmom.

## # A tibble: 3 x 5
##   famid named  incd namem  incm
##   <dbl> <chr> <dbl> <chr> <dbl>
## 1     1 Bill  30000 Bess  15000
## 2     2 Art   22000 Amy   18000
## 3     3 Paul  25000 Pat   50000

Tidy this data frame so that it adheres to the tidy data principles:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

NOTE: You can accomplish this task in a single piped operation using only tidyr functions. Code which does not use tidyr functions is acceptable, but will not merit a “check plus” on your evaluation.

Part 3: Joining data frames (2 points)

Recall the gapminder data frame we previously explored. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join the original gapminder data frame with a new data file containing the HIV prevalence rate in the country.¹

The HIV prevalence rate is stored in the data folder as a CSV file. You need to import and merge the data with gapminder to answer these two questions:

What is the relationship between HIV prevalence and life expectancy? Generate a scatterplot with a smoothing line to report your results.
Which continents have the most observations with missing HIV data? Generate a bar chart, ordered in descending height (i.e. the continent with the most missing values on the left, the continent with the fewest missing values on the right).

For each question, you need to perform a specific type of join operation. Think about what type makes the most sense and explain why you chose it.

Part 4: Exploring the General Social Survey (7 points)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. It is conducted biannually through in-person interviews using a probability sampling approach. It is one of the most commonly studied datasets in the social science disciplines.

Using the graphical data analysis skills we have reviewed in-class, you will conduct an exploratory analysis of the data to identify interesting questions and (potential) answers. Remember the types of questions we seek to answer using GDA:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Are there outliers in the data?
Do I have missingness? Are there patterns to it?
How much variation/error exists in my statistical estimates? Is there a pattern to it?

What not to do

Build a statistical model

No complex statistical methods should be employed. Focus instead on primarily graphical analysis, though you can also use basic statistical tests you may have learned in other classes (e.g. tests for normality, difference of means).

Adjust for survey weights

Do not worry about using survey weights in your exploratory analysis. Just treat every observation equally.

What you should do

The final submission should include two components.

Lab notebook (4 points)

This is a record of all your exploratory analysis. It should be extensive (minimum 30-40 graphs), and mostly code and graphics.

Minimally annotate your code and output as necessary to keep track of what you’ve done and highlight important insights gained through your exploration
It should be somewhat stream-of-conscious (that is, a stored record of your exploration as you explore the data), though certainly feel free to maintain a structure or go back and reformat different sections
Don’t bother cleaning up each graph to have meaningful labels

Exploration write-up (3 points)

In a short paper (around 750 words), summarize your insights and what you’ve learned about the data. This could include one or two important research questions you think you could answer using the data, as well as some initial hypotheses supported by your exploratory analysis. Or perhaps you’ve identified unusual variation in a single variable, or extreme outliers or systematic missingness in the data that should be accounted for in future analysis. This component will look different for each student. That’s fine. What I want to see is genuine effort and some thought put into what you’ve learned from this GDA.

This component should include mostly written analysis and a handful of graphs to support your questions and answers
Clean up these graphs so they are publication-ready. This means give each graph a meaningful title, axes labels, legends, etc.

Accessing the data

You can access this data file in the poliscidata package:

install.packages("poliscidata")
data(gss, package = "poliscidata")

# convert to tibble
library(tidyverse)
gss <- as_tibble(gss)

Dataset documentation

In the documentation folder, there are three files that are potentially relevant to your analysis.

codebook.txt - a codebook of the dataset automatically generated by Stata
GSS_Codebook_index.pdf - a list of all variables available from the GSS, with their variable names in the data file and a brief description of the variable
GSS_Codebook_mainbody.pdf - a detailed description of all variables available from the GSS, with full question wording and potential responses

You can also find more information on the survey and specific variables at the GSS website.

Submit the assignment

Your assignment should be submitted as a set of R Markdown documents. Don’t know what an R Markdown document is? Read this! Or this! I have included starter files for you to modify to complete the assignment, so you are not beginning completely from scratch.

Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

More specifically, the estimated number of people living with HIV per 100 population of age group 15-49.↩

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Homework 01/02: Wrangling and exploring data

Overview

Fork the `hw02` repository

Part 1: Exploring clean data (7 points)

Obtain the data

Explore the data

Very specific prompts

More open-ended questions

Formatting graphs

Formatting tables

Part 2: Tidying messy data (4 points)

Part 3: Joining data frames (2 points)

Submit the assignment

Homework 01/02: Wrangling and exploring data

Overview

Fork the hw02 repository

Part 1: Exploring clean data (7 points)

Obtain the data

Explore the data

Very specific prompts

More open-ended questions

Formatting graphs

Formatting tables

Part 2: Tidying messy data (4 points)

Part 3: Joining data frames (2 points)

Part 4: Exploring the General Social Survey (7 points)

What not to do

Build a statistical model

Adjust for survey weights

What you should do

Lab notebook (4 points)

Exploration write-up (3 points)

Accessing the data

Dataset documentation

Submit the assignment

Fork the `hw02` repository