library(tidyverse)
library(broom)
library(sf)
library(albersusa)

options(digits = 3)
set.seed(1234)

theme_set(theme_minimal())

\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}\]

Bivariate regression

Consider cigarette consumption in the United States. Here we have a dataset of state-level statistics in the United States from 2008.

states <- haven::read_dta("data/states.dta")

One column cigarettes reports the number of packs of cigarettes purchased bimonthly per adult in each of the 50 US states.

ggplot(states, aes(cigarettes)) +
  geom_histogram() +
  labs(title = "Cigarette consumption in the United States",
       subtitle = "By state",
       x = "Bimonthly packs of cigarettes per adult",
       y = "Number of states")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s say we want to explain cigarette consumption. What factors or independent variables do we think may have an influence on cigarette consumption? One possibility is a cigarette tax.

  • \(H_1\) – In a comparison of states, those having higher taxes on cigarettes have fewer purchases of cigarettes than states with lower taxes on cigarettes.
ggplot(states, aes(cig_tax)) +
  geom_histogram() +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Cigarette taxes in the United States",
       x = "Cigarette tax (per pack)",
       y = "Number of states")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(states, aes(cig_tax, cigarettes)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Cigarette consumption in the United States",
       x = "Cigarette tax (per pack)",
       y = "Bimonthly packs of cigarettes per adult")

cig_tax <- lm(cigarettes ~ cig_tax, data = states)
summary(cig_tax)
## 
## Call:
## lm(formula = cigarettes ~ cig_tax, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.739 -2.095 -0.211  0.986  9.857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.681      0.812   11.93  5.8e-16 ***
## cig_tax       -1.960      0.649   -3.02    0.004 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.94 on 48 degrees of freedom
## Multiple R-squared:  0.16,   Adjusted R-squared:  0.142 
## F-statistic: 9.13 on 1 and 48 DF,  p-value: 0.00403
  • Interpret \(\boldsymbol{\hat{\beta}}\)
  • Generate predicted values
  • Null hypothesis test
  • \(R^2\)

Multivariable regression

Now add a second hypothesis to the model:

  • \(H_2\) – In a comparison of states, Southern states have more purchases of cigarettes than non-Southern states.
# get simple features and join with states
us_sf <- usa_sf("laea") %>%
  left_join(states, by = c("name" = "state")) %>%
  filter(name != "District of Columbia")
## Warning: Column `name`/`state` joining factor and character vector,
## coercing into character vector
## Warning: Column `name`/`state` has different attributes on LHS and RHS of
## join
us_sf %>%
  mutate(cigarettes = cut_number(cigarettes, n = 6)) %>%
  ggplot() +
  geom_sf(aes(fill = cigarettes)) +
  scale_fill_brewer() +
  labs(title = "Cigarette consumption in the United States",
       subtitle = "Bimonthly packs of cigarettes per adult",
       fill = NULL)

south <- lm(cigarettes ~ cig_tax + south, data = states)
summary(south)
## 
## Call:
## lm(formula = cigarettes ~ cig_tax + south, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.407 -1.689 -0.389  0.912  8.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.643      1.065    7.17  4.4e-09 ***
## cig_tax       -0.876      0.726   -1.21   0.2338    
## south          2.730      0.998    2.74   0.0087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.76 on 47 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.244 
## F-statistic: 8.92 on 2 and 47 DF,  p-value: 0.000519
states %>%
  mutate(south = factor(south, levels = 1:0, labels = c("Southern", "Non-Southern"))) %>%
  ggplot(aes(cig_tax, cigarettes, color = south)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  scale_x_continuous(labels = scales::dollar) +
  scale_color_brewer(type = "qual") +
  labs(title = "Cigarette consumption in the United States",
       x = "Cigarette tax (per pack)",
       y = "Bimonthly packs of cigarettes per adult",
       color = NULL) +
  theme(legend.position = "bottom")

  • Interpret \(\boldsymbol{\hat{\beta}}\)
  • Generate predicted values
  • Null hypothesis test
  • \(R^2\)

Estimating linear regression models in R

To estimate linear models in R, we use the lm() function:

south <- lm(cigarettes ~ cig_tax + south, data = states)
summary(south)
## 
## Call:
## lm(formula = cigarettes ~ cig_tax + south, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.407 -1.689 -0.389  0.912  8.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.643      1.065    7.17  4.4e-09 ***
## cig_tax       -0.876      0.726   -1.21   0.2338    
## south          2.730      0.998    2.74   0.0087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.76 on 47 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.244 
## F-statistic: 8.92 on 2 and 47 DF,  p-value: 0.000519

The lm() function takes two parameters. The first is a formula specifying the equation to be estimated (lm() translates y ~ x into \(y = \beta_0 + \beta_1 * x\)). The second is of course the data frame containing the variables.

Note that we have now begun to leave the tidyverse universe. lm() is part of the base R program, and the result of lm() is decidedly not tidy.

str(south)
## List of 12
##  $ coefficients : Named num [1:3] 7.643 -0.876 2.73
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "cig_tax" "south"
##  $ residuals    : Named num [1:50] 0.569 8.839 -1.898 2.529 -0.998 ...
##   ..- attr(*, "label")= chr "Packs bimonthly per adult pop"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:50] -53.57 -8.88 -7.55 1.97 -2.82 ...
##   ..- attr(*, "label")= chr "Packs bimonthly per adult pop"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "names")= chr [1:50] "(Intercept)" "cig_tax" "south" "" ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:50] 9.47 10.11 10.06 9.89 7.26 ...
##   ..- attr(*, "label")= chr "Packs bimonthly per adult pop"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:50, 1:3] -7.071 0.141 0.141 0.141 0.141 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:3] "(Intercept)" "cig_tax" "south"
##   .. ..- attr(*, "assign")= int [1:3] 0 1 2
##   ..$ qraux: num [1:3] 1.14 1.17 1.1
##   ..$ pivot: int [1:3] 1 2 3
##   ..$ tol  : num 1e-07
##   ..$ rank : int 3
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 47
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = cigarettes ~ cig_tax + south, data = states)
##  $ terms        :Classes 'terms', 'formula'  language cigarettes ~ cig_tax + south
##   .. ..- attr(*, "variables")= language list(cigarettes, cig_tax, south)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "cigarettes" "cig_tax" "south"
##   .. .. .. ..$ : chr [1:2] "cig_tax" "south"
##   .. ..- attr(*, "term.labels")= chr [1:2] "cig_tax" "south"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(cigarettes, cig_tax, south)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "cigarettes" "cig_tax" "south"
##  $ model        :'data.frame':   50 obs. of  3 variables:
##   ..$ cigarettes: num [1:50] 10.04 18.95 8.16 12.42 6.26 ...
##   .. ..- attr(*, "label")= chr "Packs bimonthly per adult pop"
##   .. ..- attr(*, "format.stata")= chr "%10.0g"
##   ..$ cig_tax   : num [1:50] 1.03 0.3 0.36 0.55 0.44 2 0.17 0.695 0.18 0.91 ...
##   .. ..- attr(*, "label")= chr "Cigarette tax per pack"
##   .. ..- attr(*, "format.stata")= chr "%10.0g"
##   ..$ south     : 'labelled' num [1:50] 1 1 1 1 0 0 0 0 1 0 ...
##   .. ..- attr(*, "labels")= Named num [1:2] 0 1
##   .. .. ..- attr(*, "names")= chr [1:2] "Nonsouth" "South"
##   .. ..- attr(*, "label")= chr "Southern state?"
##   .. ..- attr(*, "format.stata")= chr "%8.0g"
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language cigarettes ~ cig_tax + south
##   .. .. ..- attr(*, "variables")= language list(cigarettes, cig_tax, south)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "cigarettes" "cig_tax" "south"
##   .. .. .. .. ..$ : chr [1:2] "cig_tax" "south"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "cig_tax" "south"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(cigarettes, cig_tax, south)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "cigarettes" "cig_tax" "south"
##  - attr(*, "class")= chr "lm"

The result is stored in a complex list that contains a lot of important information, some of which you may recognize but most of it you do not. In order to extract model statistics and use them in a tidy manner, we can use a set of functions from the broom package. For these functions, the input is always the model object itself, not the original data frame.

tidy()

tidy() constructs a data frame that summarizes the model’s statistical findings. This includes coefficients and p-values for each parameter in a regression model. Note that depending on the statistical learning method employed, the statistics stored in the columns may vary.

tidy(south)
## # A tibble: 3 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)    7.64      1.07       7.17 0.00000000445
## 2 cig_tax       -0.876     0.726     -1.21 0.234        
## 3 south          2.73      0.998      2.74 0.00874
tidy(south) %>%
  str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    3 obs. of  5 variables:
##  $ term     : chr  "(Intercept)" "cig_tax" "south"
##  $ estimate : num  7.643 -0.876 2.73
##  $ std.error: num  1.065 0.726 0.998
##  $ statistic: num  7.17 -1.21 2.74
##  $ p.value  : num  4.45e-09 2.34e-01 8.74e-03

Notice that the structure of the resulting object is a tidy data frame. Every row contains a single parameter, every column contains a single statistic, and every cell contains exactly one value.

augment()

augment() adds columns to the original data that was modeled. This could include predictions, residuals, and other observation-level statistics.

augment(south) %>%
  as_tibble()
## # A tibble: 50 x 10
##    cigarettes cig_tax south .fitted .se.fit .resid   .hat .sigma .cooksd
##  *      <dbl>   <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1      10.0    1.03  1        9.47   0.768  0.569 0.0775   2.79 1.29e-3
##  2      19.0    0.3   1       10.1    0.716  8.84  0.0674   2.44 2.65e-1
##  3       8.16   0.36  1       10.1    0.706 -1.90  0.0654   2.77 1.18e-2
##  4      12.4    0.55  1        9.89   0.690  2.53  0.0625   2.76 1.99e-2
##  5       6.26   0.44  0        7.26   0.792 -0.998 0.0824   2.78 4.27e-3
##  6       6.22   2     0        5.89   0.687  0.329 0.0620   2.79 3.35e-4
##  7       9.68   0.17  0        7.49   0.956  2.19  0.120    2.77 3.25e-2
##  8       3.58   0.695 0        7.03   0.653 -3.45  0.0560   2.74 3.28e-2
##  9       9.35   0.18  1       10.2    0.744 -0.866 0.0728   2.79 2.78e-3
## 10       4.92   0.91  0        6.85   0.557 -1.93  0.0407   2.77 7.19e-3
## # ... with 40 more rows, and 1 more variable: .std.resid <dbl>

augment() will return statistics to the original data used to estimate the model, however if you supply a data frame under the newdata argument, it will return a more limited set of statistics.

glance()

glance() constructs a concise one-row summary of the model. This typically contains values such as \(R^2\), adjusted \(R^2\), and residual standard error that are computed once for the entire model.

glance(south)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.275         0.244  2.76      8.92 5.19e-4     3  -120.  248.  256.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

While broom may not work with every model in R, it is compatible with a wide range of common statistical models. A full list of models with which broom is compatible can be found on the GitHub page for the package.

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-12-04
## Packages -----------------------------------------------------------------
##  package      * version date       source                             
##  albersusa    * 0.3.1   2018-12-04 Github (hrbrmstr/albersusa@cf69f03)
##  assertthat     0.2.0   2017-04-11 CRAN (R 3.5.0)                     
##  backports      1.1.2   2017-12-13 CRAN (R 3.5.0)                     
##  base         * 3.5.1   2018-07-05 local                              
##  bindr          0.1.1   2018-03-13 CRAN (R 3.5.0)                     
##  bindrcpp     * 0.2.2   2018-03-29 CRAN (R 3.5.0)                     
##  broom        * 0.5.0   2018-07-17 CRAN (R 3.5.0)                     
##  cellranger     1.1.0   2016-07-27 CRAN (R 3.5.0)                     
##  class          7.3-14  2015-08-30 CRAN (R 3.5.1)                     
##  classInt       0.2-3   2018-04-16 CRAN (R 3.5.0)                     
##  cli            1.0.0   2017-11-05 CRAN (R 3.5.0)                     
##  codetools      0.2-15  2016-10-05 CRAN (R 3.5.1)                     
##  colorspace     1.3-2   2016-12-14 CRAN (R 3.5.0)                     
##  compiler       3.5.1   2018-07-05 local                              
##  crayon         1.3.4   2017-09-16 CRAN (R 3.5.0)                     
##  datasets     * 3.5.1   2018-07-05 local                              
##  DBI            1.0.0   2018-05-02 CRAN (R 3.5.0)                     
##  devtools       1.13.6  2018-06-27 CRAN (R 3.5.0)                     
##  digest         0.6.18  2018-10-10 cran (@0.6.18)                     
##  dplyr        * 0.7.8   2018-11-10 cran (@0.7.8)                      
##  e1071          1.7-0   2018-07-28 CRAN (R 3.5.0)                     
##  evaluate       0.11    2018-07-17 CRAN (R 3.5.0)                     
##  forcats      * 0.3.0   2018-02-19 CRAN (R 3.5.0)                     
##  foreign        0.8-71  2018-07-20 CRAN (R 3.5.0)                     
##  ggplot2      * 3.1.0   2018-10-25 cran (@3.1.0)                      
##  glue           1.3.0   2018-07-17 CRAN (R 3.5.0)                     
##  graphics     * 3.5.1   2018-07-05 local                              
##  grDevices    * 3.5.1   2018-07-05 local                              
##  grid           3.5.1   2018-07-05 local                              
##  gtable         0.2.0   2016-02-26 CRAN (R 3.5.0)                     
##  haven          1.1.2   2018-06-27 CRAN (R 3.5.0)                     
##  hms            0.4.2   2018-03-10 CRAN (R 3.5.0)                     
##  htmltools      0.3.6   2017-04-28 CRAN (R 3.5.0)                     
##  httr           1.3.1   2017-08-20 CRAN (R 3.5.0)                     
##  jsonlite       1.5     2017-06-01 CRAN (R 3.5.0)                     
##  knitr          1.20    2018-02-20 CRAN (R 3.5.0)                     
##  labeling       0.3     2014-08-23 CRAN (R 3.5.0)                     
##  lattice        0.20-35 2017-03-25 CRAN (R 3.5.1)                     
##  lazyeval       0.2.1   2017-10-29 CRAN (R 3.5.0)                     
##  lubridate      1.7.4   2018-04-11 CRAN (R 3.5.0)                     
##  magrittr       1.5     2014-11-22 CRAN (R 3.5.0)                     
##  maptools       0.9-4   2018-09-19 cran (@0.9-4)                      
##  memoise        1.1.0   2017-04-21 CRAN (R 3.5.0)                     
##  methods      * 3.5.1   2018-07-05 local                              
##  modelr         0.1.2   2018-05-11 CRAN (R 3.5.0)                     
##  munsell        0.5.0   2018-06-12 CRAN (R 3.5.0)                     
##  nlme           3.1-137 2018-04-07 CRAN (R 3.5.1)                     
##  pillar         1.3.0   2018-07-14 CRAN (R 3.5.0)                     
##  pkgconfig      2.0.2   2018-08-16 CRAN (R 3.5.1)                     
##  plyr           1.8.4   2016-06-08 CRAN (R 3.5.0)                     
##  purrr        * 0.2.5   2018-05-29 CRAN (R 3.5.0)                     
##  R6             2.3.0   2018-10-04 cran (@2.3.0)                      
##  RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.5.0)                     
##  Rcpp           1.0.0   2018-11-07 cran (@1.0.0)                      
##  readr        * 1.1.1   2017-05-16 CRAN (R 3.5.0)                     
##  readxl         1.1.0   2018-04-20 CRAN (R 3.5.0)                     
##  rgdal          1.3-6   2018-10-16 cran (@1.3-6)                      
##  rgeos          0.4-2   2018-11-08 cran (@0.4-2)                      
##  rlang          0.3.0.1 2018-10-25 CRAN (R 3.5.0)                     
##  rmarkdown      1.10    2018-06-11 CRAN (R 3.5.0)                     
##  rprojroot      1.3-2   2018-01-03 CRAN (R 3.5.0)                     
##  rstudioapi     0.7     2017-09-07 CRAN (R 3.5.0)                     
##  rvest          0.3.2   2016-06-17 CRAN (R 3.5.0)                     
##  scales         1.0.0   2018-08-09 CRAN (R 3.5.0)                     
##  sf           * 0.7-1   2018-10-24 CRAN (R 3.5.0)                     
##  sp             1.3-1   2018-06-05 CRAN (R 3.5.0)                     
##  spData         0.2.9.3 2018-08-01 CRAN (R 3.5.0)                     
##  stats        * 3.5.1   2018-07-05 local                              
##  stringi        1.2.4   2018-07-20 CRAN (R 3.5.0)                     
##  stringr      * 1.3.1   2018-05-10 CRAN (R 3.5.0)                     
##  tibble       * 1.4.2   2018-01-22 CRAN (R 3.5.0)                     
##  tidyr        * 0.8.1   2018-05-18 CRAN (R 3.5.0)                     
##  tidyselect     0.2.5   2018-10-11 cran (@0.2.5)                      
##  tidyverse    * 1.2.1   2017-11-14 CRAN (R 3.5.0)                     
##  tools          3.5.1   2018-07-05 local                              
##  units          0.6-0   2018-06-09 CRAN (R 3.5.0)                     
##  utils        * 3.5.1   2018-07-05 local                              
##  withr          2.1.2   2018-03-15 CRAN (R 3.5.0)                     
##  xml2           1.2.0   2018-01-24 CRAN (R 3.5.0)                     
##  yaml           2.2.0   2018-07-25 CRAN (R 3.5.0)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.