PDF file location: http://www.murraylax.org/rtutorials/regression_coefficients.pdf

HTML file location: http://www.murraylax.org/rtutorials/regression_coefficients.html

In this tutorial, we will focus on an example that explores the relationship between total monthly earnings and years of education. We will estimate the following regression equation:

\[ y_i = b_0 + b_1 x_i + e_i \]

where \(y_i\) denotes the *income* of individual \(i\), and \(x_i\) denotes the number of *years of education* of individual \(i\).

The code below downloads a CSV file that includes data from 1980 for 935 individuals on variables including their total monthly earnings (`MonthlyEarnings`

) and a number of variables that could influence income, including years of education (`YearsEdu`

). The data set originally comes from textbook website for Stock and Watson’s *Introduction to Econometrics*.

`wages <- read.csv("http://murraylax.org/datasets/wage2.csv");`

We estimate the simple regression with the following call to `lm()`

and store the output in an object we call `lmwages`

:

`lmwages <- lm(MonthlyEarnings ~ YearsEdu, data=wages)`

We can print a summary of the results with the following call to the `summary()`

function:

`summary(lmwages)`

```
##
## Call:
## lm(formula = MonthlyEarnings ~ YearsEdu, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -877.38 -268.63 -38.38 207.05 2148.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.952 77.715 1.891 0.0589 .
## YearsEdu 60.214 5.695 10.573 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 382.3 on 933 degrees of freedom
## Multiple R-squared: 0.107, Adjusted R-squared: 0.106
## F-statistic: 111.8 on 1 and 933 DF, p-value: < 2.2e-16
```

These ‘Estimate’ column of the coefficients table implies the equation for the best fitting line is given by, \[\hat{y}_i = 146.95 + 60.21 x_i.\]

The regression coefficient on years of education, \(b_1=60.21\), implies that in our sample each additional year of education is associated with $60.21 higher monthly earnings. Suppose we wanted to test the hypothesis that having more years of education is associated with a change in monthly earnings. The null and alternative hypotheses would be as follows:

\[ H_0: \beta_1 = 0 \] \[ H_A: \beta_1 \neq 0 \]

The null hypothesis says that in the population the slope is equal to zero. This implies that the changing the explanatory variable, years of education, is associated with zero change on the outcome variable, monthly earnings.

The alternative hypothesis says that in the population the slope is different than zero. This implies that changing the explanatory variable, years of education, is associated with a change in the outcome variable, monthly earnings.

Results from this hypothesis test are reported in the summary above. The p-values *for a two-tailed test* are in the column labeled, `Pr(>|t|)`

. In our case, the p-value on `YearsEdu`

is equal to \(2.2~\times~10^{-16}\). Since this is far below a significance level equal to 5%, we reject the null hypothesis and conclude having more years of education is associated with a difference in monthly earnings.

Suppose instead we have reason to believe that more education should result in higher average income, so we want to test the hypothesis that having *more education* is associated with *higher monthly earnings*. For this, we will conduct the following one-tailed hypothesis test on the coefficient:

\[ H_0: \beta_1 = 0 \] \[ H_A: \beta_1 > 0 \]

Again the null hypothesis says that in the population the slope is equal to zero, therefore years of education is associated with zero change for monthly earnings.

The alternative hypothesis says that in the population the slope is greater than zero. This implies that higher years of education is associated with higher average monthly earnings.

We can use the same output from above. If the estimated coefficient is indeed positive as stated in our alternative hypothesis, we can use the two-tailed p-value from the table, but divide it by two, so as to only include the area in the right side tail.

Our p-value is therefore equal to \(1.1~\times~10^{-16}\), which is far below the 5% significance level, so we reject the null hypothesis and conclude having more years of education is associated with higher average monthly earnings.

Our sample evidence suggests that a single additional year of education is associated with an additional $60.21$ in monthly earnings. A 95% confidence interval can give us an interval estimate for our belief of the size of this impact, based on an estimate for the margin of error due to random sampling. We can compute a 95% confidence interval with the following call to `confint()`

:

`confint(lmwages,level=0.95)`

```
## 2.5 % 97.5 %
## (Intercept) -5.56393 299.46881
## YearsEdu 49.03783 71.39074
```

From the result from the row labeled `wages$YearsEdu`

, we can say with 95% confidence that one additional year of education is associated with higher average monthly earnings within the range of $49.04 and $71.39.