library(tidyverse)
library(forcats)
library(broom)
library(modelr)
library(stringr)
library(ISLR)
library(titanic)
library(rcfss)
library(haven)
library(car)
library(lmtest)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())
\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}} \newcommand{\Lagr}{\mathcal{L}} \newcommand{\lagr}{\mathcal{l}}\]
Basic linear regression follows the functional form:
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
where \(Y_i\) is the value of the response variable \(Y\) for the \(i\)th observation, \(X_i\) is the value for the explanatory variable \(X\) for the \(i\)th observation. The coefficients \(\beta_0\) and \(\beta_1\) are population regression coefficients - our goal is to estimate these population parameters given the observed data. \(\epsilon_i\) is the error representing the aggregated omitted causes of \(Y\), other explanatory variables that could be included in the model, measurement error in \(Y\), and any inherently random component of \(Y\).
The key assumptions of linear regression concern the behavior of the errors.
The expectation of the error is 0:
\[\E(\epsilon_i) \equiv E(\epsilon_i | X_i) = 0\]
This allows us to recover the expected value of the response variable as a linear function of the explanatory variable:
\[ \begin{aligned} \mu_i \equiv E(Y_i) \equiv E(Y | X_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ \mu_i &= \beta_0 + \beta_1 X_i + E(\epsilon_i) \\ \mu_i &= \beta_0 + \beta_1 X_i + 0 \\ \mu_i &= \beta_0 + \beta_1 X_i \end{aligned} \]
Because \(\beta_0\) and \(\beta_1\) are fixed parameters in the population, we can remove them from the expectation operator.
The variance of the errors is the same regardless of the values of \(X\):
\[\Var(\epsilon_i | X_i) = \sigma^2\]
The errors are assumed to be normally distributed:
\[\epsilon_i \mid X_i \sim N(0, \sigma^2)\]
Observations are sampled independently from one another. Any pair of errors \(\epsilon_i\) and \(\epsilon_j\) are independent for \(i \neq j\). Simple random sampling from a large population will ensure this assumption is met. However data collection procedures frequently (and explicitly) violate this assumption (e.g. time series data, panel survey data).
\(X\) is assumed to be fixed or measured without error and independent of the error. With a fixed \(X\), the researcher controls the precise value of \(X\) for a given observation (think experimental design with treatment/control). In observational study, we assume \(X\) is measured without error and that the explanatory variable and the error are independent in the population from which the sample is drawn.
\[\epsilon_i \sim N(0, \sigma^2), \text{for } i = 1, \dots, n\]
If \(X\) is fixed, it must vary (i.e. it’s values cannot all be the same). If \(X\) is random, then in the population \(X\) must vary. You cannot estimate a regression line for an invariant \(X\).
data_frame(x = 1,
y = rnorm(10)) %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "You cannot regress this",
subtitle = "Slope is undefined")
If these assumptions are violated, conducting inference from linear regression becomes tricky, biased, inefficient, and/or error prone. You could move to a more robust inferential method such as nonparametric regression, decision trees, support vector machines, etc., but these methods are more tricky to generate inference about the explanatory variables. Instead, we can attempt to diagnose assumption violations and impose solutions while still constraining ourselves to a linear regression framework.
Outliers are observations that are somehow unusual, either in their value of \(Y_i\), of one or more \(X_i\)s, or some combination thereof. Outliers have the potential to have a disproportionate influence on a regression model.
Influence - how much effect a particular observation’s value(s) on \(Y\) and \(X\) have on the coefficient estimates. Influence is a function of leverage and discrepancy:
\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]
flintstones <- tribble(
~name, ~x, ~y,
"Barney", 13, 75,
"Dino", 24, 300,
"Betty", 14, 250,
"Fred", 10, 220,
"Wilma", 8, 210
)
ggplot(flintstones, aes(x, y, label = name)) +
geom_smooth(data = filter(flintstones, name %in% c("Wilma", "Fred", "Betty")),
method = "lm", se = FALSE, fullrange = TRUE,
aes(linetype = "Betty + Fred + Wilma",
color = "Betty + Fred + Wilma")) +
geom_smooth(data = filter(flintstones, name != "Dino"),
method = "lm", se = FALSE, fullrange = TRUE,
aes(linetype = "Barney + Betty + Fred + Wilma",
color = "Barney + Betty + Fred + Wilma")) +
geom_smooth(data = filter(flintstones, name != "Barney"),
method = "lm", se = FALSE, fullrange = TRUE,
aes(linetype = "Betty + Dino + Fred + Wilma",
color = "Betty + Dino + Fred + Wilma")) +
scale_linetype_manual(values = c(3,2,1)) +
scale_color_brewer(type = "qual") +
geom_point(size = 2) +
ggrepel::geom_label_repel() +
labs(linetype = NULL,
color = NULL) +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 3, reverse = TRUE),
linetype = guide_legend(nrow = 3, reverse = TRUE))
Leverage is typically assessed using the leverage (hat) statistic:
\[h_i = \frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{j=1}^{n} (X_{j} - \bar{X})^2}\]
Observations with a leverage statistic greater than the average could have high leverage.
Residuals are a natural way to look for discrepant or outlying observations (discrepant observations typically have large residuals, or differences between actual and fitted values for \(y_i\).) The problem is that variability of the errors \(\hat{\epsilon}_i\) do not have equal variances, even if the actual errors \(\epsilon_i\) do have equal variances:
\[\Var(\hat{\epsilon}_i) = \sigma^2 (1 - h_i)\]
High leverage observations tend to have small residuals, which makes sense because they pull the regression line towards them. Alternatively we can calculate a standardized residual which parses out the variability in \(X_i\) for \(\hat{\epsilon}_i\):
\[\hat{\epsilon}_i ' \equiv \frac{\hat{\epsilon}_i}{S_{E} \sqrt{1 - h_i}}\]
where \(S_E\) is the standard error of the regression:
\[S_E = \sqrt{\frac{\hat{\epsilon}_i^2}{(n - k - 1)}}\]
The problem is that the numerator and the denominator are not independent - they both contain \(\hat{\epsilon}_i\), so \(\hat{\epsilon}_i '\) does not follow a \(t\)-distribution. Instead, we can modify this measure by calculating \(S_{E(-i)}\); that is, refit the model deleting each \(i\)th observation, estimating the standard error of the regression \(S_{E(-i)}\) based on the remaining \(i-1\) observations. We then calculate the studentized residual:
\[\hat{\epsilon}_i^{\ast} \equiv \frac{\hat{\epsilon}_i}{S_{E(-i)} \sqrt{1 - h_i}}\]
which now has an independent numerator and denominator and follows a \(t\)-distribution with \(n-k-2\) degrees of freedom. They are on a common scale and we should expect roughly 95% of the studentized residuals to fall within the interval \([-2,2]\).
As described previously, influence is the a combination of an observation’s leverage and discrepancy. In other words, influence is the effect of a particular observation on the coefficient estimates. A simple measure of that influence is the difference between the coefficient estimate with and without the observation in question:
\[D_{ij} = \hat{\beta_1j} - \hat{\beta}_{1j(-i)}, \text{for } i=1, \dots, n \text{ and } j = 0, \dots, k\]
This measure is called \(\text{DFBETA}_{ij}\). Since coefficient estimates are scaled differently depending on how the variables are scaled, we can rescale \(\text{DFBETA}_{ij}\) by the coefficient’s standard error to account for this fact:
\[D^{\ast}_{ij} = \frac{D_{ij}}{SE_{-i}(\beta_{1j})}\]
This measure is called \(\text{DFBETAS}_{ij}\).
Frequently \(\text{DFBETA}\)s are used to construct summary statistics of each observation’s influence on the regression model. Cook’s D is based on the theory that one could conduct an \(F\)-test on each observation for the hypothesis that \(\beta_{1j} = \hat{\beta}_{1k(-i)} \forall j \in J\). The formula for this measure is:
\[D_i = \frac{\hat{\epsilon}^{'2}_i}{k + 1} \times \frac{h_i}{1 - h_i}\]
where \(\hat{\epsilon}^{'2}_i\) is the squared standardized residual, \(k\) is the number of parameters in the model, and \(\frac{h_i}{1 - h_i}\) is the hat value. We look for values of \(D_i\) that stand out from the rest.
For example, here are the results of a basic model of the number of federal laws struck down by the U.S. Supreme Court in each Congress, based on:
# read in data and estimate model
dahl <- read_dta("data/LittleDahl.dta")
dahl_mod <- lm(nulls ~ age + tenure + unified, data = dahl)
tidy(dahl_mod)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -12.1 2.54 -4.76 0.00000657
## 2 age 0.219 0.0448 4.88 0.00000401
## 3 tenure -0.0669 0.0643 -1.04 0.300
## 4 unified 0.718 0.458 1.57 0.121
A major concern with regression analysis of this data is that the results are being driven by outliers in the data.
dahl <- dahl %>%
mutate(year = congress * 2 + 1787)
ggplot(dahl, aes(year, nulls)) +
geom_line() +
geom_vline(xintercept = 1935, linetype = 2) +
labs(x = "Year",
y = "Congressional laws struck down")
ggplot(dahl, aes(year, age)) +
geom_line() +
geom_vline(xintercept = 1935, linetype = 2) +
labs(x = "Year",
y = "Mean age of justices on the Court")
During the 74th Congress (1935-36), the New Deal/Court-packing crisis was associated with an abnormally large number of laws struck down by the court. We should determine whether or not this observation is driving our results.
By combining all three variables into a “bubble plot”, we can visualize all three variables simultaneously.
# add key statistics
dahl_augment <- dahl %>%
mutate(hat = hatvalues(dahl_mod),
student = rstudent(dahl_mod),
cooksd = cooks.distance(dahl_mod))
# draw bubble plot
ggplot(dahl_augment, aes(hat, student)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(aes(size = cooksd), shape = 1) +
geom_text(data = dahl_augment %>%
arrange(-cooksd) %>%
slice(1:10),
aes(label = Congress)) +
scale_size_continuous(range = c(1, 20)) +
labs(x = "Leverage",
y = "Studentized residual") +
theme(legend.position = "none")
The bubble plot tells us several things:
These are not hard and fast rules rigorously defended by mathematical proofs; they are simply potential rules of thumb to follow when interpreting the above statistics.
Anything exceeding twice the average \(\bar{h} = \frac{k + 1}{n}\) is noteworthy. In our example that would be the following observations:
dahl_augment %>%
filter(hat > 2 * mean(hat))
## # A tibble: 9 x 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1st 1 0 49.8 0.800 1 1789 0.0974 0.330 0.00296
## 2 3rd 3 0 52.8 4.20 0 1793 0.113 0.511 0.00841
## 3 12th 12 0 49 6.60 1 1811 0.0802 0.669 0.00980
## 4 17th 17 0 59 16.6 1 1821 0.0887 -0.253 0.00157
## 5 20th 20 0 61.7 17.4 1 1827 0.0790 -0.577 0.00719
## 6 23rd 23 0 64 18.4 1 1833 0.0819 -0.844 0.0159
## 7 34th 34 0 64 14.6 0 1855 0.0782 -0.561 0.00671
## 8 36th 36 0 68.7 17.8 0 1859 0.102 -1.07 0.0326
## 9 99th 99 3 71.9 16.7 0 1985 0.0912 0.295 0.00221
Anything outside of the range \([-2,2]\) is discrepant.
dahl_augment %>%
filter(abs(student) > 2)
## # A tibble: 7 x 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67th 67 6 66 9 1 1921 0.0361 2.14 0.0415
## 2 74th 74 10 71.1 14.2 1 1935 0.0514 4.42 0.223
## 3 90th 90 6 64.7 13.3 1 1967 0.0195 2.49 0.0292
## 4 91st 91 6 65.1 13 1 1969 0.0189 2.42 0.0269
## 5 92nd 92 5 62 9.20 1 1971 0.0146 2.05 0.0150
## 6 98th 98 7 69.9 14.7 0 1983 0.0730 3.02 0.165
## 7 104th 104 8 60.6 12.5 1 1995 0.0208 4.48 0.0897
\[D_i > \frac{4}{n - k - 1}\]
where \(n\) is the number of observations and \(k\) is the number of coefficients in the regression model.
dahl_augment %>%
filter(cooksd > 4 / (nrow(.) - (length(coef(dahl_mod)) - 1) - 1))
## # A tibble: 4 x 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67th 67 6 66 9 1 1921 0.0361 2.14 0.0415
## 2 74th 74 10 71.1 14.2 1 1935 0.0514 4.42 0.223
## 3 98th 98 7 69.9 14.7 0 1983 0.0730 3.02 0.165
## 4 104th 104 8 60.6 12.5 1 1995 0.0208 4.48 0.0897
If the data is just wrong (miscoded, mismeasured, misentered, etc.), then either fix the error, impute a plausible value for the observation, or omit the offending observation.
If the data for a particular observation is just strange, then you may want to ask “why is it so strange?”
For example, let’s re-estimate the SCOTUS model and omit observations that were commonly identified as outliers:1
dahl_omit <- dahl %>%
filter(!(congress %in% c(74, 98, 104)))
dahl_omit_mod <- lm(nulls ~ age + tenure + unified, data = dahl_omit)
coefplot::multiplot(dahl_mod, dahl_omit_mod,
names = c("All observations",
"Omit outliers")) +
theme(legend.position = "bottom")
# rsquared values
rsquare(dahl_mod, dahl)
## [1] 0.232
rsquare(dahl_omit_mod, dahl_omit)
## [1] 0.258
# rmse values
rmse(dahl_mod, dahl)
## [1] 1.68
rmse(dahl_omit_mod, dahl_omit)
## [1] 1.29
Recall that OLS assumes errors are distributed normally:
\[\epsilon_i | X_i \sim N(0, \sigma^2)\]
However according to the central limit theorem, inference based on the least-squares estimator is approximately valid under broad conditions.2 So while the validity of the estimates is robust to violating this assumption, the efficiency of the estimates is not robust. Recall that efficiency guarantees us the smallest possible sampling variance and therefore the smallest possible mean squared error (MSE). Heavy-tailed or skewed distributions of the errors will therefore give rise to outliers (which we just recognized as a problem). Alternatively, we interpret the least-squares fit as a conditional mean \(Y | X\). But arithmetic means are not good measures of the center of a highly skewed distribution.
Graphical interpretations are easiest to detect non-normality in the errors. Consider a regression model using survey data from the 1994 wave of Statistics Canada’s Survey of Labour and Income Dynamics (SLID), explaining hourly wages as an outcome of sex, education, and age:
(slid <- read_tsv("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/SLID-Ontario.txt"))
## Parsed with column specification:
## cols(
## age = col_integer(),
## sex = col_character(),
## compositeHourlyWages = col_double(),
## yearsEducation = col_integer()
## )
## # A tibble: 3,997 x 4
## age sex compositeHourlyWages yearsEducation
## <int> <chr> <dbl> <int>
## 1 40 Male 10.6 15
## 2 19 Male 11 13
## 3 46 Male 17.8 14
## 4 50 Female 14 16
## 5 31 Male 8.2 15
## 6 30 Female 17.0 13
## 7 61 Female 6.7 12
## 8 46 Female 14 14
## 9 43 Male 19.2 18
## 10 17 Male 7.25 11
## # ... with 3,987 more rows
slid_mod <- lm(compositeHourlyWages ~ sex + yearsEducation + age, data = slid)
tidy(slid_mod)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.12 0.599 -13.6 5.27e- 41
## 2 sexMale 3.47 0.207 16.8 4.04e- 61
## 3 yearsEducation 0.930 0.0343 27.1 5.47e-149
## 4 age 0.261 0.00866 30.2 3.42e-180
car::qqPlot(slid_mod)
## [1] 967 3911
The above figure is a quantile-comparison plot, graphing for each observation its studentized residual on the \(y\) axis and the corresponding quantile in the \(t\)-distribution on the \(x\) axis. The dashed lines indicate 95% confidence intervals calculated under the assumption that the errors are normally distributed. If any observations fall outside this range, this is an indication that the assumption has been violated. Clearly, here that is the case.
augment(slid_mod, slid) %>%
mutate(.student = rstudent(slid_mod)) %>%
ggplot(aes(.student)) +
geom_density(adjust = .5) +
labs(x = "Studentized residuals",
y = "Estimated density")
From the density plot of the studentized residuals, we can also see that the residuals are positively skewed.
Power and log transformations are typically used to correct this problem. Here, trial and error reveals that by log transforming the wage variable, the distribution of the residuals becomes much more symmetric:
slid <- slid %>%
mutate(wage_log = log(compositeHourlyWages))
slid_log_mod <- lm(wage_log ~ sex + yearsEducation + age, data = slid)
tidy(slid_log_mod)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.10 0.0380 28.9 1.97e-167
## 2 sexMale 0.224 0.0131 17.1 2.16e- 63
## 3 yearsEducation 0.0559 0.00217 25.7 2.95e-135
## 4 age 0.0182 0.000549 33.1 4.50e-212
car::qqPlot(slid_log_mod)
## [1] 2760 3267
augment(slid_log_mod, slid) %>%
mutate(.student = rstudent(slid_log_mod)) %>%
ggplot(aes(.student)) +
geom_density(adjust = .5) +
labs(x = "Studentized residuals",
y = "Estimated density")
Recall that linear regression assumes the error terms \(\epsilon_i\) have a constant variance, \(\text{Var}(\epsilon_i) = \sigma^2\). This is called homoscedasticity. Remember that the standard errors directly rely upon the estimate of this value:
\[\widehat{\se}(\hat{\beta}_{1j}) = \sqrt{\hat{\sigma}^{2} (X'X)^{-1}_{jj}}\]
If the variances of the error terms are non-constant (aka heteroscedastic), our estimates of the parameters \(\hat{\beta}_1\) will still be unbiased because they do not depend on \(\sigma^2\). However our estimates of the standard errors will be inaccurate - they will either be inflated or deflated, leading to incorrect inferences about the statistical significance of predictor variables.
We can uncover homo- or heteroscedasticity through the use of the residual plot. Below is data generated from the process:
\[Y_i = 2 + 3X_i + \epsilon\]
where \(\epsilon_i\) is random error distributed normally \(N(0,1)\).
sim_homo <- data_frame(x = runif(1000, 0, 10),
y = 2 + 3 * x + rnorm(1000, 0, 1))
sim_homo_mod <- glm(y ~ x, data = sim_homo)
sim_homo %>%
add_predictions(sim_homo_mod) %>%
add_residuals(sim_homo_mod) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, linetype = 2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Homoscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 5)
Compare this to a linear model fit to the data generating process:
\[Y_i = 2 + 3X_i + \epsilon_i\]
where \(\epsilon_i\) is random error distributed normally \(N(0,\frac{X}{2})\). Note that the variance for the error term of each observation \(\epsilon_i\) is not constant, and is itself a function of \(X\).
sim_hetero <- data_frame(x = runif(1000, 0, 10),
y = 2 + 3 * x + rnorm(1000, 0, (x / 2)))
sim_hetero_mod <- glm(y ~ x, data = sim_hetero)
sim_hetero %>%
add_predictions(sim_hetero_mod) %>%
add_residuals(sim_hetero_mod) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, linetype = 2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Heteroscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 5)
We see a distinct funnel-shape to the relationship between the predicted values and the residuals. This is because by assuming the variance is constant, we substantially over or underestimate the actual response \(Y_i\) as \(X_i\) increases.
There are formal statistical tests to check for heteroscedasticity. One such test is the Breusch-Pagan test. The procedure is:
The lmtest
library contains a function for the Breusch-Pagan test:
bptest(sim_homo_mod)
##
## studentized Breusch-Pagan test
##
## data: sim_homo_mod
## BP = 2, df = 1, p-value = 0.1
bptest(sim_hetero_mod)
##
## studentized Breusch-Pagan test
##
## data: sim_hetero_mod
## BP = 100, df = 1, p-value <2e-16
Instead of assuming the errors have a constant variance \(\text{Var}(\epsilon_i) = \sigma^2\), instead we can assume that the errors are independent and normally distributed with mean zero and different variances \(\epsilon_i \sim N(0, \sigma_i^2)\):
\[ \begin{bmatrix} \sigma_1^2 & 0 & 0 & 0 \\ 0 & \sigma_2^2 & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \sigma_n^2 \\ \end{bmatrix} \]
We can define the reciprocal of each variance \(\sigma_i^2\) as the weight \(w_i = \frac{1}{\sigma_i^2}\), then let matrix \(\mathbf{W}\) be a diagonal matrix containing these weights:
\[ \mathbf{W} = \begin{bmatrix} \frac{1}{\sigma_1^2} & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2^2} & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \frac{1}{\sigma_n^2} \\ \end{bmatrix} \]
So rather than following the traditional linear regression estimator
\[\hat{\mathbf{\beta_1}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}\]
we can substitute in the weighting matrix \(\mathbf{W}\):
\[\hat{\mathbf{\beta_1}} = (\mathbf{X}' \mathbf{W} \mathbf{X})^{-1} \mathbf{X}' \mathbf{W} \mathbf{y}\]
\[\sigma_{i}^2 = \frac{\sum(w_i \hat{\epsilon}_i^2)}{n}\]
This is equivalent to minimizing the weighted sum of squares, according greater weight to observations with smaller variance.
How do we estimate the weights \(W_i\)?
For example, using the first approach on our original SLID model:
# do we have heteroscedasticity?
slid_mod <- glm(compositeHourlyWages ~ sex + yearsEducation + age, data = slid)
slid %>%
add_predictions(slid_mod) %>%
add_residuals(slid_mod) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, linetype = 2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Heteroscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 5)
# convert residuals to weights
weights <- 1 / residuals(slid_mod)^2
slid_wls <- lm(compositeHourlyWages ~ sex + yearsEducation + age, data = slid, weights = weights)
tidy(slid_mod)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.12 0.599 -13.6 5.27e- 41
## 2 sexMale 3.47 0.207 16.8 4.04e- 61
## 3 yearsEducation 0.930 0.0343 27.1 5.47e-149
## 4 age 0.261 0.00866 30.2 3.42e-180
tidy(slid_wls)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.10 0.0160 -506. 0
## 2 sexMale 3.47 0.00977 356. 0
## 3 yearsEducation 0.927 0.00142 653. 0
## 4 age 0.261 0.000170 1534. 0
We see some mild changes in the estimated parameters, but drastic reductions in the standard errors. The problem is that this reduction is potentially biased through the estimated covariance matrix because the sampling error in the estimates should reflect the additional source of uncertainty, which is not explicitly accounted for just basing it on the original residuals. Instead, it would be better to model the weights as a function of relevant explanatory variables.3
Alternatively, we could instead attempt to correct for heteroscedasticity only in the standard error estimates. This produces the same estimated parameters, but adjusts the standard errors to account for the violation of the constant error variance assumption (we won’t falsely believe our estimates are more precise than they really are.) One major estimation procedure are Huber-White standard errors (also called robust standard errors) which can be recovered using the car::hccm()
function:4
hw_std_err <- hccm(slid_mod, type = "hc1") %>%
diag %>%
sqrt
tidy(slid_mod) %>%
mutate(std.error.rob = hw_std_err)
## # A tibble: 4 x 6
## term estimate std.error statistic p.value std.error.rob
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.12 0.599 -13.6 5.27e- 41 0.636
## 2 sexMale 3.47 0.207 16.8 4.04e- 61 0.207
## 3 yearsEducation 0.930 0.0343 27.1 5.47e-149 0.0385
## 4 age 0.261 0.00866 30.2 3.42e-180 0.00881
Notice that these new standard errors are a bit larger than the original model, accounting for the increased uncertainty of our parameter estimates due to heteroscedasticity.
By assuming the average error \(\E (\epsilon_i)\) is 0 everywhere implies that the regression line (surface) accurately reflects the relationship between \(X\) and \(Y\). Violating this assumption means that the model fails to capture the systematic relationship between the response and explanatory variables. Therefore here, the term nonlinearity could mean a couple different things:
Detecting nonlinearity can be tricky in higher-dimensional regression models with multiple explanatory variables.
Define the partial residual for the \(j\)th explanatory variable:
\[\hat{\epsilon}_i^{(j)} = \hat{\epsilon}_i + \hat{\beta}_j X_{ij}\]
In essence, calculate the least-squares residual (\(\hat{\epsilon}_i\)) and add to it the linear component of the partial relationship between \(Y\) and \(X_j\). Finally, we can plot \(X_j\) versus \(\hat{\epsilon}_i^{(j)}\) and assess the relationship. For instance, consider the results of the logged wage model from earlier:
# get partial resids
slid_resid <- residuals(slid_log_mod, type = "partial") %>%
as_tibble
names(slid_resid) <- str_c(names(slid_resid), "_resid")
slid_diag <- augment(slid_log_mod, slid) %>%
bind_cols(slid_resid)
ggplot(slid_diag, aes(age, age_resid)) +
geom_point(alpha = .1) +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = 2) +
labs(x = "Age",
y = "Partial residual for age")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(slid_diag, aes(yearsEducation, yearsEducation_resid)) +
geom_point(alpha = .1) +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = 2) +
labs(x = "Education (years)",
y = "Partial residual for education")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The solid lines are GAMs, while the dashed lines are linear least-squares fits. For age, the partial relationship with logged wages is not linear - some transformation of age is necessary to correct this. For education, the relationship is more approximately linear except for the discrepancy for individual with very low education levels.
We can correct this by adding a squared polynomial term for age, and square the education term. The resulting regression model is:
\[\log(\text{Wage}) = \beta_0 + \beta_1(\text{Male}) + \beta_2 \text{Age} + \beta_3 \text{Age}^2 + \beta_4 \text{Education}^2\]
slid_log_trans <- lm(wage_log ~ sex + I(yearsEducation^2) + age + I(age^2), data = slid)
tidy(slid_log_trans)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.397 0.0578 6.87 7.62e- 12
## 2 sexMale 0.221 0.0124 17.8 3.21e- 68
## 3 I(yearsEducation^2) 0.00181 0.0000786 23.0 1.19e-109
## 4 age 0.0830 0.00319 26.0 2.93e-138
## 5 I(age^2) -0.000852 0.0000410 -20.8 3.85e- 91
Because the model is now nonlinear in both age and education, we need to rethink how to draw the partial residuals plot. The easiest approach is to plot the partial residuals for both age and education against the original explanatory variable. For age, that is
\[\hat{\epsilon}_i^{\text{Age}} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i + \hat{\epsilon}_i\]
and for education,
\[\hat{\epsilon}_i^{\text{Education}} = 0.002 \times \text{Education}^2_i + \hat{\epsilon}_i\]
On the same graph, we also plot the partial fits for the two explanatory variables:
\[\hat{Y}_i^{(\text{Age})} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i\]
and for education,
\[\hat{Y}_i^{(\text{Education})} = 0.002 \times \text{Education}^2_i\]
On the graphs, the solid lines represent the partial fits and the dashed lines represent the partial residuals. If the two lines overlap significantly, then the revised model does a good job accounting for the nonlinearity.
# get partial resids
slid_trans_resid <- residuals(slid_log_trans, type = "partial") %>%
as_tibble
names(slid_trans_resid) <- c("sex", "education", "age", "age_sq")
names(slid_trans_resid) <- str_c(names(slid_trans_resid), "_resid")
slid_trans_diag <- augment(slid_log_trans, slid) %>%
as_tibble %>%
mutate(age_resid = coef(slid_log_trans)[[4]] * age +
coef(slid_log_trans)[[5]] * age^2,
educ_resid = coef(slid_log_trans)[[5]] * yearsEducation^2)
ggplot(slid_trans_diag, aes(age, age_resid + .resid)) +
geom_point(alpha = .1) +
geom_smooth(aes(y = age_resid), se = FALSE) +
geom_smooth(se = FALSE, linetype = 2) +
labs(x = "Age",
y = "Partial residual for age")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(slid_trans_diag, aes(yearsEducation, educ_resid + .resid)) +
geom_point(alpha = .1) +
geom_smooth(aes(y = educ_resid), se = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = 2) +
labs(x = "Education (years)",
y = "Partial residual for education")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Collinearity (or multicollinearity) is a state of a model where explanatory variables are correlated with one another.
Perfect collinearity is incredibly rare, and typically involves using transformed versions of a variable in the model along with the original variable. For example, let’s estimate a regression model explaining mpg
as a function of displ
, wt
, and cyl
:
mtcars1 <- lm(mpg ~ disp + wt + cyl, data = mtcars)
summary(mtcars1)
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.403 -1.403 -0.495 1.339 6.072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.10768 2.84243 14.46 1.6e-14 ***
## disp 0.00747 0.01184 0.63 0.5332
## wt -3.63568 1.04014 -3.50 0.0016 **
## cyl -1.78494 0.60711 -2.94 0.0065 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.4 on 3 and 28 DF, p-value: 5.4e-11
Now let’s say we want to recode displ
so it is centered around it’s mean and re-estimate the model:
mtcars <- mtcars %>%
mutate(disp_mean = disp - mean(disp))
mtcars2 <- lm(mpg ~ disp + wt + cyl + disp_mean, data = mtcars)
summary(mtcars2)
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl + disp_mean, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.403 -1.403 -0.495 1.339 6.072
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.10768 2.84243 14.46 1.6e-14 ***
## disp 0.00747 0.01184 0.63 0.5332
## wt -3.63568 1.04014 -3.50 0.0016 **
## cyl -1.78494 0.60711 -2.94 0.0065 **
## disp_mean NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.4 on 3 and 28 DF, p-value: 5.4e-11
Oops. What’s the problem? disp
and disp_mean
are perfectly correlated with each other:
ggplot(mtcars, aes(disp, disp_mean)) +
geom_point()
Because they perfectly explain each other, we cannot estimate a linear regression model that contains both variables.5 Fortunately R automatically drops the second variable so it can estimate the model. Because of this, perfect multicollinearity is rarely problematic in social science.
Instead consider the credit dataset:
credit <- read_csv("data/Credit.csv") %>%
select(-X1)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## Income = col_double(),
## Limit = col_integer(),
## Rating = col_integer(),
## Cards = col_integer(),
## Age = col_integer(),
## Education = col_integer(),
## Gender = col_character(),
## Student = col_character(),
## Married = col_character(),
## Ethnicity = col_character(),
## Balance = col_integer()
## )
names(credit) <- tolower(names(credit))
ggplot(credit, aes(limit, age)) +
geom_point()
Age and limit are not strongly correlated with one another, so estimating a linear regression model to predict an individual’s balance as a function of age and limit is not a problem:
age_limit <- lm(balance ~ age + limit, data = credit)
tidy(age_limit)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -173. 43.8 -3.96 9.01e- 5
## 2 age -2.29 0.672 -3.41 7.23e- 4
## 3 limit 0.173 0.00503 34.5 1.63e-121
But what about using an individual’s credit card rating instead of age? It is likely a good predictor of balance as well:
ggplot(credit, aes(rating, balance)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
limit_rate <- lm(balance ~ limit + rating, data = credit)
tidy(limit_rate)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -378. 45.3 -8.34 1.21e-15
## 2 limit 0.0245 0.0638 0.384 7.01e- 1
## 3 rating 2.20 0.952 2.31 2.13e- 2
coefplot::multiplot(age_limit, limit_rate)
By replacing age with rating, we developed a problem in our model. The problem is that limit and rating are strongly correlated with one another:
ggplot(credit, aes(limit, rating)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
coefplot::multiplot(age_limit, limit_rate, predictors = "limit")
In the regression model, it is difficult to parse out the independent effects of limit and rating on balance, because limit and rating tend to increase and decrease in association with one another. Because the accuracy of our estimates of the parameters is reduced, the standard errors increase. This is why you can see above that the standard error for limit is much larger in the second model compared to the first model.
A correlation or scatterplot matrix would help to reveal any strongly correlated variables:
library(GGally)
ggcorr(select_if(credit, is.numeric))
ggpairs(select_if(credit, is.numeric))
Here it is very clear that limit and rating are strongly correlated with one another.
Unfortunately correlation matrices may not be sufficient to detect collinearity if the correlation exists between three or more variables (aka multicollinearity) while not existing between any two pairs of these variables. Instead, we can calculate the variance inflation factor (VIF) which is the ratio of the variance of \(\hat{\beta}_{1j}\) when fitting the full model divided by the variance of \(\hat{\beta}_{1j}\) if fit on its own model. We can use the car::vif()
function in R to calculate this statistic for each coefficient. A good rule of thumb is that a VIF statistic greater than 10 indicates potential multicollinearity in the model. Applied to the credit
regression models above:
vif(age_limit)
## age limit
## 1.01 1.01
vif(limit_rate)
## limit rating
## 160 160
Drop one or more of the collinear variables from the model
This is not a good idea, even if it makes your results “significant”. By omitting the variable, you are completely re-specifying your model in direct contradiction to your theory. If your theory suggests that a variable can be dropped, go ahead. But if not, then don’t do it.
The more observations, the better. It could at least decrease your standard errors and give you more precise estimates. And if you add “odd” or unusual observations, it could also reduce the degree of multicollinearity.
If the variables are indicators of the same underlying concept, you can combine them into an index variable. This could be an additive index where you sum up comparable covariates or binary indicators. Alternatively, you could create an index via principal components analysis.
Shrinkage methods involve fitting a model involving all \(p\) predictors and shrinking the estimated coefficients towards zero. This shrinkage reduces variance in the model. When multicollinearity is high, the variance of the estimator \(\hat{\beta}_1\) is also high. By shrinking the estimated coefficient towards zero, we may increase bias in exchange for smaller variance in our estimates.
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.5.1 (2018-07-02)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-11-29
## Packages -----------------------------------------------------------------
## package * version date source
## abind 1.4-5 2016-07-21 CRAN (R 3.5.0)
## assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
## base * 3.5.1 2018-07-05 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
## bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
## broom * 0.5.0 2018-07-17 CRAN (R 3.5.0)
## car * 3.0-0 2018-04-02 CRAN (R 3.5.0)
## carData * 3.0-1 2018-03-28 CRAN (R 3.5.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.5.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
## compiler 3.5.1 2018-07-05 local
## crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
## curl 3.2 2018-03-28 CRAN (R 3.5.0)
## data.table 1.11.4 2018-05-27 CRAN (R 3.5.0)
## datasets * 3.5.1 2018-07-05 local
## devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
## digest 0.6.18 2018-10-10 cran (@0.6.18)
## dplyr * 0.7.8 2018-11-10 cran (@0.7.8)
## evaluate 0.11 2018-07-17 CRAN (R 3.5.0)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.5.0)
## foreign 0.8-71 2018-07-20 CRAN (R 3.5.0)
## ggplot2 * 3.1.0 2018-10-25 cran (@3.1.0)
## glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
## graphics * 3.5.1 2018-07-05 local
## grDevices * 3.5.1 2018-07-05 local
## grid 3.5.1 2018-07-05 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
## haven * 1.1.2 2018-06-27 CRAN (R 3.5.0)
## hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
## ISLR * 1.2 2017-10-20 CRAN (R 3.5.0)
## jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
## knitr 1.20 2018-02-20 CRAN (R 3.5.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
## lmtest * 0.9-36 2018-04-04 CRAN (R 3.5.0)
## lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
## methods * 3.5.1 2018-07-05 local
## modelr * 0.1.2 2018-05-11 CRAN (R 3.5.0)
## munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
## nlme 3.1-137 2018-04-07 CRAN (R 3.5.1)
## openxlsx 4.1.0 2018-05-26 CRAN (R 3.5.0)
## pillar 1.3.0 2018-07-14 CRAN (R 3.5.0)
## pkgconfig 2.0.2 2018-08-16 CRAN (R 3.5.1)
## plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
## purrr * 0.2.5 2018-05-29 CRAN (R 3.5.0)
## R6 2.3.0 2018-10-04 cran (@2.3.0)
## rcfss * 0.1.5 2018-05-30 local
## Rcpp 1.0.0 2018-11-07 cran (@1.0.0)
## readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
## readxl 1.1.0 2018-04-20 CRAN (R 3.5.0)
## rio 0.5.10 2018-03-29 CRAN (R 3.5.0)
## rlang 0.3.0.1 2018-10-25 CRAN (R 3.5.0)
## rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.5.0)
## scales 1.0.0 2018-08-09 CRAN (R 3.5.0)
## stats * 3.5.1 2018-07-05 local
## stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
## stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.5.0)
## tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.0)
## tidyselect 0.2.5 2018-10-11 cran (@0.2.5)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.0)
## titanic * 0.1.0 2015-08-31 CRAN (R 3.5.0)
## tools 3.5.1 2018-07-05 local
## utils * 3.5.1 2018-07-05 local
## withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
## xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
## yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
## zip 1.0.0 2017-04-25 CRAN (R 3.5.0)
## zoo * 1.8-3 2018-07-16 CRAN (R 3.5.0)
74th (1935-36), 98th (1983-84), and 104th (1995-96).↩
Assuming the sample size is sufficiently large.↩
Omitted here for time purposes. You can find details on this estimation procedure on the internet.↩
This function really returns the “sandwich” estimator of the variance-covariance matrix, so we need to further take the square root of the diagonal of this matrix.↩
Basically we cannot invert the variance-covariance matrix of \(\mathbf{X}\) because the collinear columns in \(\mathbf{X}\) are perfectly linearly dependent on each other. Because of this, we cannot get parameter estimates or standard errors for the model.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.