Inference with Heteroskedasticity

Note on required packages: The following code requires the packages lmtest, sandwich, and tidyverse. The sandwich package contains procedures to estimate regression error variance that may change with the explanatory variables. The lmtest package contains procedures to conduct hypothesis tests when there is heteroskedasticity. If you have not done so, download and install the package lmtest, sandwich, and tidyverse. When the packages are installed, load these libraries.



1 Introduction

Homoskedasticity is the property that the variance of the error term of a regression (estimated by the variance of the residual in the sample) is the same across different values for the explanatory variables, or the same across time for time series models.

Heteroskedasticity is the property when the variance of the error term changes predictably with one or more of the explanatory variables.

2 Example: Factors affecting monthly earnings

Let us examine a data set that explores the relationship between total monthly earnings (MonthlyEarnings) and a number of factors that may influence monthly earnings including including each person’s IQ (IQ), a measure of knowledge workplace environment (Knowledge), years of education (YearsEdu), years experience (YearsExperience), and years at current job (Tenure).

The code below downloads a CSV file that includes data on the above variables from 1980 for 663 individuals and assigns it to a data set called df.

The following call to lm() estimates a multiple regression predicting monthly earnings based on the five explanatory variables given above. The call to summary() displays some summary statistics from the regression.

3 Heteroskedasticity robust standard errors

White’s correction for heteroskedasticity is a method for using OLS estimates for coefficients and predicted values (which are unbiased), but fixing the estimates for the variances of the coefficients (which are biased). It uses as an estimate for the possibly changing variance the squared residuals estimated from OLS which we computed and graphed above.

In R, we can compute the White heteroskedastic variance/covariance matrix for the coefficients with the call below to vcovHC from the sandwich package. VCOVHC stands for Variance / Covariance Heteroskedastic Consistent.

The first parameter in the call above is our original output from our call to lm() above. The second parameter type="HC1" tells the function to use the White estimate for the variance covariance matrix which uses as an estimate for the changing variance the squared residuals from the OLS call.

4 Hypothesis testing

We can use our estimate for the variance / covariance to properly compute our standard errors, t-statistics, and p-values for the coefficients:

## 
## t test of coefficients:
## 
##                  Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)     -512.5397   135.5678 -3.7807 0.0001706 ***
## IQ                 3.8887     1.1352  3.4256 0.0006517 ***
## Knowledge          8.0476     2.4932  3.2278 0.0013093 ** 
## YearsEdu          46.3084     8.6877  5.3303 1.349e-07 ***
## YearsExperience   13.4454     4.1497  3.2401 0.0012550 ** 
## Tenure             3.3915     3.0319  1.1186 0.2637138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first parameter to coeftest is the result of our original call to lm() and the second parameter is the updated variance / covariance matrix to use.

Let’s compare the result to the OLS regression output:

## 
## Call:
## lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + 
##     Tenure, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -849.51 -244.91  -41.28  191.41 2225.88 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -512.540    139.180  -3.683 0.000250 ***
## IQ                 3.889      1.204   3.230 0.001299 ** 
## Knowledge          8.048      2.246   3.582 0.000366 ***
## YearsEdu          46.308      8.833   5.243 2.13e-07 ***
## YearsExperience   13.445      4.064   3.309 0.000989 ***
## Tenure             3.392      3.016   1.124 0.261262    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.6 on 657 degrees of freedom
## Multiple R-squared:  0.1796, Adjusted R-squared:  0.1733 
## F-statistic: 28.76 on 5 and 657 DF,  p-value: < 2.2e-16

The coefficients are exactly the same, but the estimates for the standard errors, the t-statistics, and p-values are slightly different.

One reason the standard errors and p-values are only slightly different is that there may actually be very little heteroskedasticity. The differences in variance of the error term may be small even for large differences in the predicted value for \(x\).

The White heteroskedastic robust standard errors are valid for either homoskedasticity or heteroskedasticity, so it is always safe to use these estimates if you are not sure if the homoskedasticity assumption holds.

5 Confidence Intervals

The function confint() computes confidence intervals for the coefficients, but assumes homoskedasticity. Here is an example:

##                       2.5 %      97.5 %
## (Intercept)     -785.831826 -239.247619
## IQ                 1.524802    6.252556
## Knowledge          3.636458   12.458672
## YearsEdu          28.964873   63.651868
## YearsExperience    5.465652   21.425242
## Tenure            -2.531316    9.314351

Unfortunately, there R includes no similar method to compute confidence intervals for coefficients with heteroskedastic-robust standard errors. You can compute the confidence intervals manually based on the estimates of the variances from the vcovHC() function. I created such a function, which you can download and load into memory using the following call:

We now have a function in our environment called confintHC(). We use this function in the code below to estimate 95% confidence intervals on the coefficients using the White heteroskedasticity estimates for the variances.

##                 Coefficient  2.5% Limit 97.5% Limit
## (Intercept)     -512.539723 -778.738116 -246.341329
## IQ                 3.888679    1.659646    6.117712
## Knowledge          8.047565    3.152031   12.943099
## YearsEdu          46.308370   29.249347   63.367394
## YearsExperience   13.445447    5.297253   21.593641
## Tenure             3.391518   -2.561851    9.344887

6 Joint Test for Multiple Restrictions

Let us allow for an interactions effect between IQ and years of experience and IQ and years of education. The interaction effect with years experience allow for the possibility that people with a higher IQ get a greater return to years experience, possibly because people with a higher IQ learn more with each additional year of experience. The interaction effect with years education allows for the possibility that people with a higher IQ get greater benefits for each additional year of education.

Now suppose we want to test whether or not IQ has any influence on monthly earnings. The variable IQ appears multiple times: once on its own and twice in interaction effects. The null and alternative hypotheses for this test is given by,

\[ H_0: \beta_{IQ} = 0,~~ \beta_{IQxYearsEdu} = 0,~~ \beta_{IQxYearsExperience} = 0 \] \[ H_A: \mbox{At least one of these is not equal to zero} \]

The standard F-test for multiple restrictions that compares the sum of squared explained between an unrestricted and (nested) restricted model (often called the Wald test) assumes homoskedasticity, and so this test is not appropriate if heteroskedasticity is present.

There is a heteroskedasticity-robust version of the test which can be estimated with the waldtest() function. The default output of waldtest() is the same as anova(), but it has the additional power that we can specify the correct variance / covariance matrix.

First we estimate the restricted model, which does not include IQ in any of the terms:

Then we call waldtest(), passing both the restricted and unrestricted regression outputs, and we include our estimate for the variance / covariance robust estimator for the unrestricted model:

## Wald test
## 
## Model 1: MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + 
##     Tenure + IQ:YearsExperience + IQ:YearsEdu
## Model 2: MonthlyEarnings ~ Knowledge + YearsEdu + YearsExperience + Tenure
##   Res.Df Df      F   Pr(>F)   
## 1    655                      
## 2    658 -3 4.0298 0.007423 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is 0.007 which is below 0.05. We reject the null hypothesis and conclude that there is sufficient statistical evidence that IQ does help explain monthly earnings.

James M. Murray, Ph.D.
University of Wisconsin - La Crosse
R Tutorials for Applied Statistics